


PEKIOT’ 
GENERAL ! 
UNIY. OF 


mm JOURNAL ~ 
DUCATION 





Volume I Marcu, 1933 Sieisibes 3 














CONTENTS 


Experimental Procedures in Test Evaluation: Linpgutst, E. F., and 
Cook, WaLTER W. Tea 


The Interpretation of the Coefficient of Correlation: Monroe, WALTER S., 
and Stuit, Dewey B. 


On the Accuracy with Which Reliability May Be Measured by Correlating 
Test Halves: BROWNELL, WILLIAM A. 


Need for Standardization of Symbols and Formulae in Educational Statistics: 
West, Paut V. .. 216 


Standardization of Statistical Symbolism: Monroe, WALTER S. . 2 


Some Considerations Relative to the Standardization of Certain Procedures 
in Educational Research: Toors, HerBert A. i Se 


Simplified Schemas for Multiple Linear Correlation: GrirFin, Harotp D. 239 


A Frequency-Product Method of Obtaining the Standard Deviation: Toops, 
Herpert A. 


Validation Against a Fallible Criterion: Cureton, Epwarp E. ........... 258 


The Differential Predictive Value of the Psychological Examination of the 
American Council on Education: Watts, J. Vircit 


On the Limits of Predicting Scholastic Success: Eastey, Howarp 


The Validity of Certain Prognostic Tests in Predicting Algebraic Ability: 
Torcerson, T. L., and Aamopt, GENEVA P. ...................000: 277 


A Natural Test of English Usage: Becx, Rorann L. ................... 280 


Preliminary Report on the Stanford Binet IQ Changes of Superior Children: 
Lincotn, E. A 











PUBLISHED QUARTERLY 








EDWARDS BROTHERS, INC. 
LITHOPRINTERS AND PUBLISHERS ANN ArBor, MICHIGAN 
Application filed for entry as second class matter. 
Printed in U. S. A. 











; 
| 


es Ses 


o © tia he 


Dies Spi BS 1 Oe ns CRIT int 


Eee Se 


-_ od j 
« te tS eee a. 
une S «Le wie *, 


Fh 








<a Sly Te 
re = aE =F 











JOURNAL OF EXPERIMENTAL EDUCATION 


EDITORIAL BOARD 
A. S. Barr, Chairman, Professor of Education, University of Wisconsin, Madison, Wisconsin. 


Carter V. Good, Professor of Education, Uni- 
versity of Cincinnati, Cincinnati, Ohio. Edi- 
torially responsible for materials on super- 
vision and psychology of learning and teach- 
ing. 

Henry Harap, Associate Professor of Educa- 
tion, Western Reserve University, Cleveland, 
Ohio. Editorially responsible for materials on 
experimental studies of curriculum construc- 
tion. 


Walter S. Monroe, Professor of Education, Unj- 
versity of Illinois, Urbana, Illinois. Editorj- 
ally responsible for materials on measure- 
ments, statistics, and methods of experimen. 
tal research. 


George D. Stoddard, Director, Child Welfare 
Research Station, State University of Iowa, 
Iowa City, Iowa. Editorially responsible for 
materials on child welfare, guidance, and de- 
velopment. 


CONTRIBUTING EDITORS 


Harry J. Baker, Director, Psychological Clinic, Detroit 
Public Schools, Detroit, Michigan. 

W. E. Blatz, Nursery School Division, University of 
Toronto, Toronto 5, Canada. 

William F. Book, Head, Department of Psychology and 
Philosophy, Indiana University, Bloomington, Indiana 

Fowler D. Brooks, Head, Departments of Education 
and Psychology, DePauw University, Greencastle, 
Indiana. 

William A. Brownell, Professor of Educational Psy- 
chology, Duke University, Durham, North Carolina. 
Leo J. Brueckner, Professor of Education, University of 

Minnesota, Minneapolis, Minnesota. 

Herbert B. Bruner, Professor of Education, Teachers 
College, Columbia University, New York City. 

Barbara S. Burks, Psychologist, Institute of Child Wel- 
fare, University of California, Berkeley, California. 

Otis W. Caldwell, Professor of Education, Teachers 
College, Columbia University, New York City. 

Ellsworth Collings, Dean, College of Education, Univer- 
sity of Oklahoma, Norman, Oklahoma. 

Philip W. L. Cox, Professor of Secondary Education, 
New York University, New York City. 

Edgar A. Doll, Director of Research, Training School, 
Vineland, New Jersey. 

Harl R. Douglass, Professor of Education, University 
of Minnesota, Minneapolis, Minnesota. 

Jack W. Dunlap, Associate Professor of Education, 
Fordham University, New York City. 

Paul H. Furfey, Professor of Psychology, Catholic Uni- 
versity of America, Washington, D. C 

Florence L. Goodenough, Professor, Institute of Child 
Welfare, University of Minnesota, Minneapolis, Minn. 

M. E. Herriott, assistant director Division of Psychol- 
ogy and Educational Research, Public Schools, Los 
Angeles, California. 

Karl J. Holzinger, Professor of Education, University 
of Chicago, Chicago, Illinois. 

L. Thomas Hopkins, Curriculum Specialist, Lincoln 
School of Teachers College, 425 West 123rd Street, 
New York City. 

C. L. Huffaker, Professor of Education, Bureau of Edu- 
cational Research, University of Oregon, Eugene, Ore. 

Kai Jensen, Assistant Professor of Education, Univer- 
sity of Wisconsin, Madison, Wisconsin. 

Harold E. Jones, Director of Research, Institute of 
Child Welfare, University of California, Berkeley, 
California. 

Edward A. Lincoln, Assistant Professor of Education, 
Graduate School of Education, Harvard University, 
Cambridge, Massachusetts. 

E. F. Lindquist, Professor of Education, State Univer- 
sity of Iowa, Iowa City, Iowa 


Lois Hayden Meek, Professor of Education, Director. 
Child Development Institute, Teachers College, 
Columbia University, New York City. 

C. W. Odell, Associate Professor of Education, Univer- 
sity of Illinois, Urbana, Illinois. 

Willard C. Olson, Director of Research in Child Devel- 
opment, University of Michigan, Ann Arbor, Mich. 


W. E. Peik, Associate Professor of Education, Univer- 
sity of Minnesota, Minneapolis, Minnesota. 

S. L. Pressey, Professor of Educational Psychology, 
Ohio State University, Columbus, Ohio. 

W. H. Pyle, Professor of Educational Psychology, De- 
troit Teachers College, Detroit, Michigan. 

Paul T. Rankin, Supervising Director, Board of Educa- 
tion, Detroit, Michigan. 

Clarence E. Ragsdale, Assistant Professor of Education, 
University of Wisconsin, Madison, Wisconsin. 

H. H. Remmers, Director, Division of Educational Ref- 
erence, Professor of Education and Psychology, Pur- 
due University, Lafayette, Indiana. 

G. M. Ruch, Professor of Education, University of Cal- 
ifornia, Berkeley, California. 

Earl U. Rugg, Head, Department of Education, Colo- 
rado State Teachers College, Greeley, Colorado. 

Peter Sandiford, Professor of Educational Psychology, 
Director, Educational Research, University of Tor- 
onto, Toronto, Canada. 

Douglas E. Scates, Director, School Research, Cincin- 
nati Public Schools, Cincinnati, Ohio. 

Raleigh Schorling, Professor of Education, Supervisor, 
Directed Teaching and Instruction in University High 
School, University of Michigan, Ann Arbor, Michigan. 

Mandel Sherman, Associate Professor of Education, 
University of Chicago, Chicago, Illinois. 

Helen Thompson, Research Associate, Yale University, 
New Haven, Connecticut. 

Herbert A. Toops, Professor of Psychology, Statistics 
and College Personnel, Ohio State University, Colum- 
bus, Ohio. 

T. L. Torgerson, Assistant Professor of Education, Uni- 
versity of Wisconsin, Madison, Wisconsin. 

M. R. Trabue, Executive Secretary of Committee on 
Individual Diagnosis and Training, Institute for Em- 
ployment Stabilization Research, University of Min- 
nesota, Minneapolis, Minnesota. 

R. W. Tyler, Professor of Education, Ohio State Uni- 
versity, Columbus, Ohio. 

Douglas Waples, Graduate Library School, University 
of Chicago, Chicago, Illinois. 

Beth L. Wellman, Rescarch Associate Professor, Child 
Welfare Research Station, State University of Iowa, 
Iowa City, Iowa. 


™ JOURNAL + 
EXPERIMENTAL EDUCATION 





March, 1933 





EXPERIMENTAL PROCEDURES IN TEST EVALUATION 
E. F. Lindquist 
State University of Iowa 


and 


Walter W. Cook 
Eastern Illinois State Teachers College 


I. OPTIMUM ADMINISTRATION TIME 


Both the reliability and the validity 
of a given achievement test of the pencil 
and paper type are very often largely a func- 
tion of the time in which the test is admin- 
istered, i1.e., the time allowed the pupils 
to complete the test. Obvious as this fact 
may seem, it is, nevertheless, one which 
has been disregarded in nearly all research 
into the relative effectiveness of test 
techniques. So grievous is this oversight, 
that, in the opinion of the authors, it ren- 
ders relatively inconclusive the results of 
nearly all such researches thus far published, 
In support of this contention, it is the 
purpose of this article to discuss the in- 
fluence of administration time upon test va- 
lidity and reliability, to introduce and de- 
fine the concept of optimum administration 
time, and to suggest a technique for deter- 
mining this optimum time for any given body 
of test material. ; 


The usual procedure in research intend- 
ed to determine the relative effectiveness 
of two techniques for measuring the same a- 
bility may be illustrated by the following 
hypothetical situation, which may be found 
duplicated in all its essential characteris- 
tics in published articles in almost any 


field of objective test construction. Let 
us suppose the experimenter desires to de- 
termine which of two ways of testing spell- 





ing ability, e.g., the multiple-choice and 
right-wrong types of recognition tests, is 
most valid and reliable. For this purpose 
he constructs one test of each type, using 
the same set of basic words in each. After 
a criterion measure has been secured for a 
given group of pupils, these two forms are 
then administered to the same group under con- 
trolled conditions, but often in quite arbi- 
trarily determined administration times for 
each form. Suppose, however, that through 
preliminary experimentation it has been found 
that ten minutes are required for eighty per 
cent of a sample of pupils to complete the 
right-wrong test, while fifteen minutes are 
required for the same proportion to complete 
the miltiple-choice form. In the controlled 
experiment then, the two forms are adminis- 
tered at these predetermined times.! Let us 
suppose that, when administered in ten min- 
utes, the right-wrong form yields a reliabil- 
ity of .72, while the mitiple-choice form, 
when administered in fifteen minutes, yields 
a reliability of .80. The reliability for 
the first form is then “stepped up” by means 
of the Spearman-Brown Prophecy Formula, to 
the same time limit as was used for the long- 
er form. In this case the "stepped up" reli- 
ability for the first form would be .83. The 
critical comparison is then made in terms of 
reliability coefficients for equal adminis- 
tration times. (A similar procedure could be 
followed for validity coefficients, using 





1. The time may be chosen as that at which any otier given per cent of the group complete the test, including one hun- 


dred per cent. 


In any case, however, an arbitrary decision is made by the experimenter. 











a oa 


SE Une Sew «i. 
fot ge 


ee We ee eed eT ST eT i Be. 


“St hen... 





=F 


‘- 
a 


+ Pes 


| 


; 








—~ 


3 


AOE PEE os aha Aaa ae lage 


: ay > - O° Pra ees; hl =. 
ms : : : 





? 
Pe ere 


- 


164 JOURNAL OF EXPERIMENTAL EDUCATION 


techniques later discussed in this article, 
but this has rarely if ever been done in 
past research.) 

The latter part of this procedure ap- 
pears to take the time factor into consider- 
ation, but actually it disregards it at the 
most critical point. There is no conclusive 
evidence that the best time for administer- 
ing a test is that which is required for com- 
pletion by eicthty per cent (or any other per 
cent) of the pupils, nor is there any evi- 
dence that this proportion (if it is a valid 
determiner) does not vary for different tyves 
of tests. In the last analysis then, under 
the type of procedure described, the time 
limits for each test are arbitrarily deter- 
mined. The superiority, in either reliabil- 
ity or validity, thus determined for either 
form micht therefore be reversed if other, 
and equally arbitrary, time limits were cho- 
sen. 

It should be clear, in light of this 
argument, that no valid comparison can be 
made between two test forms until an objec- 
tive method has been devised for controlling 
this time factor, 1.6., of determining, 
through preliminary experimentation, the time 
in which each form should be administered. 
Furthermore, this time should in each case 
be the optimum, that which is most favorable 
to the form in question. 

The problem of the optimum time in which 
to administer a given amount of test mate- 
rial can perhaps best be approached by con- 
sidering first the related question of the 
optimum amount of material (of a given type 
and quality) to administer in a given period 
of time. The answer to this latter question 
is so obvious as to require no defense. The 
optimum amount of material to administer in 
a given time limit is that which will yield 
the maximum validity in that time limit. 
This could readily be experimentally deter- 
mined for any type and quality of test items 
and for any given time period by successive- 
ly determining the validity of different a- 
mounts of material, each administered in the 
given time. That a definite optimum would 
always exist is hardly questionable. More 
material than the optimum would result in a 
lower validity due primarily to too much em 
phasis on the speed factor, while less mate- 
rial than the optimum would result in lower 








Volume I, No, 3 


validity due to the more limited san 
pling. 

If for any given time period there ex- 
ists an optimum amount of test material of a 
given type, then for that amount of material 
the given time must also be considered as the 
optimum. That this converse relationship 
holds is not as obvious as it may seem. It 
may be true, e.¢., for a given group of pu- 
pils and for a given spelling test technique, 
the optimum amount of material for a fifteen 
minute testing period is eighty-five items-- 
i.e., the validity of the total scores ob- 
tained would be lowered if either more or 
less than eirshty-five items were administer- 
ed in the fifteen minute period. It does not 
follow, however, that the given test of 
eighty-five items will yield the maximum va- 
lidity when administered in fifteen minutes, 
It may be, for instance, thet if the same 
test is administered in twenty minutes a 
still higher total validity will be secured. 
An administration period of fifteen minutes 
would nevertheless be the optimum for the 
eighty-five item test, since it is in that 
length of time that the hishest validity is 
secured per unit of time. To increase the 
time beyond this period would result in a 
lower validity than could otherwise be se- 
cured in the augmented time by increasing the 
amount of material to the optimum for thet 
time. 

The optimum time of administration for 
any given test may then be defined as _ the 
shortest time at which a greater increase in 
the validity of the obtained scores can be 
secured through the addition of more (homo- 
geneous) material with a proportionate in- 
crease of time than by permitting more time 
to be spent on the same material. 

Since this definition is suggestive of 
the technique to be described later, and al- 
so since it is somewhat involved, it may 
warrant some further explanatory discussion. 
Suppose that a self-administering spelling 
test of fifty items is so timed that only 2 
very small proportion of the group tested is 
able to complete the test. Let us say that 
this time is five minutes. For this adminis- 
tration time, the test will yield a certain 
correlation with a criterion measure (valid- 
ity) and a certain self-correlation (relia- 
bility). Under the conditions described, it 








March, 1933 


is quite likely that a higher validity may 
be secured for the same test by increasing 
the administration time to, say, seven and 
one-half minutes. It would, however, also 
be possible to secure an increased validity 
in the measures obtainable in this latter 
amount of time (seven and one-half minutes) 
by keeping the rate or time per item the 
same, and by administering seventy-five i- 
tems (homogeneous with the original) instead 
of fifty. 

If the first of these alternatives were 
to yield a higher validity than the second, 
i.e., if in seven and one-half minutes a 
higher validity were secured from a test of 
fifty than from a test of seventy-five items, 
then of course seven and one-half minutes 
is inadequate for a seventy-five item test, 
and hence five minutes is inadequate for a 
fifty item test. This eventuality, in oth- 
er words, would set a lower limit to the op- 
timum time. 

If, however, the latter alternative 
were to yield the higher validity, i.e., if 
in seven and one-half minutes a higher va- 
lidity were secured from a test of seventy- 
five than from a test of fifty items, then 
of course, seven and one-half minutes is more 
than adequate for fifty items. This eventu- 
ality, then, would set an upper limit to the 
optimum time, 

For any given administration time (be- 
low the optimum), therefore, there are two 
alternative procedures for securing an in- 
crease in validity by increasing the time, 
One involves increasing the time only and 
holding material (number of items) constant; 
the other involves increasing both material 
and time proportionately. Depending upon 
the length of the given administration time 
under consideration, one of these alterna- 
tives will result in greater validity for 
the augmented time than the other, 

By trying both of these alternatives at 
each of a number of different administration 
times, it is possible to determine the upper 
and lower limits of the optimum time within 
as narrow a range as is desired. The opti- 
mum administration time for a given test, 
then, is that at which, from that time up- 
ward, greater increase in validity can be 
secured by increasing both time and material 


E. F. Lindquist and W. W. Cook 





165 


proportionately than by simply allowing more 
time to be spent on the same material. 

The experimental procedure for deter- 
mining the optimum administration time for a 
given test requires that scores be secured 
for various time limits from a random sample 
of pupils at the desired educational level, 
and that the validity and reliability coef- 
ficients be computed for each time limit. 
This may be done by administering several e- 
quivalent forms of the test to the same group 
of pupils, or by administering the same form 
to several equated groups, or by a third and 
more expedient procedure to be described la- 
ter. Experimental time limits must be set 
in such a way as to preclude the possibility 
that the optimum time will be outside their 
range, while the size of the interval between 
them will depend upon the accuracy with which 
it is desired to determine the optimum time. 

Suppose, for example, that each of four 
equivalent forms of a given test of sixty i- 
tems is administered in three, four, five, 


and six minutes respectively, and that for 
these time limits the reliability and valid- 
ity coefficients are secured as follows :1 


Reliability 


(F) 
(G) 
(H) 
(I) 


Validity 


-58 (A) 
-69 (B) 
+74 (C) 
-75 (D) 


Minutes of Time 





3 91 
4 87 
5 84 
6 82 


Now let us consider the first of these 
administration times (three minutes), At this 
time, two alternatives, already described, 
present themselves for securing additional 
validity by increasing the time to four min- 
utes. The results of one of these (simply 
increasing time but holding material con- 
stant) is shown in the preceding table and 
in Figure 1 on the following page. When 
four minutes are allowed for the original 
sixty item test, the validity is increased 
to .69, while the reliability is reduced to 
87. 

It would have been possible, however, 
while increasing the time from three to four 
minutes, to have increased the amount of ma~ 
terial in the same proportion. That is, we 
might have increased the number of items 
from sixty to eighty, while also increasing 








I. These values, which are no 


etical but are taken from actual experimentation with spelling test material, are 


graphically represented by the lettered points on the solid lines in Figure 1. 








Ob*e seco Vees. res — 
Predicted | ¢aa. 768 ~~ > >>> 











3 ry $ * ? 
pmunutes of testing time 


Pig. 1 Voldity and Reliability of a Giver Test 
When Administered in Differcn’ Time Limits 


the time in proportion from three to four 
minutes. The effects of this procedure upon 
reliability and validity can be satisfectori- 
ly predicted without actual trial by means 
of the Spearman-Brown Prophecy Formula and 
the following formula provided by Kelley! for 
the correlation between a criterion and the 
sum or average of a number of equally weight- 
ed scores: 


V 


ji5cu * Pi 


in which V is the validity and r,; the reli- 
ability of the test at the shorter time lim- 
it and V' is the predicted validity of a 
longer test of homogeneous material adminis- 
tered at a proportionately longer time lim- 
it, and N equals the retio between the 
lengths of the two tests. In the case of 
the illustration, V would be .58, r would 
be .91, and N would be # or 1.33. Applying 
this formula, we find that a test 14 times 
as long as the given test, when administered 
in four minutes, would yield a validity of 
-59. The increase in validity that would 
thus be secured is shown in Figure 1] by the 
broken line AB', which in this case lies be- 
low the solid line AB. It is clear in this 
case that the alternative of allowing more 
time on the original test is the better. 





yr = 








JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


That ‘is, at point A a greater improvement in 
validity can be secured by holding materia} 
constent and increasing time only, than by 
increasing both the time and amount of mate- 
rial proportionately. 

Havine thus established three minutes as 
less than the optimum, let us next consider 
the effect of increasing the time from four 
to five minutes. If five minutes of time 
are allowed for the original sixty items, 
the validity is increased to .74 (point C). 
If, however, while increasing the time from 
four to five minutes, we also increase the 
number of items in the same proportion, the 
validity will only be increased to .70 (point 
Ct). Again it is clear that the alternative 
of simply allowing more time on the original 
material is the better. We have thus estab- 
lished four minutes as less than the optimun, 

We have now reached the point, however, 
at which further increase of time on the o- 
riginal test will not result in as great an 
increase in validity as may be secured by in- 
creasing both time and material proportion- 
ately. An extension of time from five to six 
minutes on the original test increases the 
validity to only .74 (point D), while if the 
same increase in time is accompanied by a 
proportional increase in the number of items, 
the validity may be increased to .76 (point 
D'), The optimm time is thus established as 
definitely less than six minutes. 

We have now set the lower and upper lim- 
its for the optimum time at four and six min- 
utes respectively. For practical purposes, 
we are therefore justified in accepting a 
five minute period as a close approximation 
to the true optimum. It is possible, with 
the given data, to interpolate along the 
smoothed curve to determine the optimum time 
more exactly, but considering the other sourc- 
es of error in an actual experiment of this 
type, such procedure would hardly be justi- 
fied. Obviously, the accuracy with which 
the optimum time may be determined by this 
method depends upon the size of the interval 
between experimental time limits, while the 
number of such time limits necessary to en- 
ploy depends upon the ability of the experi- 
menter to estimate the optimum time before- 
hand, 





1. T. L. Kelley, Statistical Method (New York: The Macmillan Company, 1925), p. 200, Formula 152. 

















mo” 








March, 1933 


An obvious practical objection to the 
procedure already described is that it re- 
quires that a number of equivalent forms be 
available for the test whose optimum admir- 
istration time is to be determined, The al- 
ternative has been suggested of administer- 
inc a single form to several equated groups, 
but this procedure requires very large ex- 
oerimental sroups and often introduces ser- 
fous difficulties in equating the groups 
used, Another procedure, and one which has 
been successfully used by the authors, in- 
volves the use of colored pencils with a sin- 
gle form and a single group. Previous to the 
administration of the test, each pupil is pro 
vided with four colored pencils--black, red, 
green, and blue, Four such pencils are placed 
on the desk before each of the pupils. They 
are then instructed to begin work on the 
test using the black pencil, working as rap- 
idly but as carefully as possible, taking 
the items in order, At the end of the 
first time limit a command is given to change 
to the red pencil and continue working as 
before. Similerly a change is made to the 
blue pencil at the end of the second time 
limit, and to the green pencil at the end 
of the third. The pupils are permitted to 
so back over the test and make any changes 
they desire during the latter periods of the 
testing time. The scores for the first time 
limit may then be summated from the items 
marked in black, for the second time limit 
from the items marked in black and red, etc, 

There are several theoretical objec- 
tions to the colored pencil technique, in- 
volving possible differences in mental set 
of the pupils and in their procedure in writ- 
ing the test as compared with the first tech 
nique described. In an experiment by L. F. 
xc Donough, + when the two techniques were 
checked against each other, using intelli- 
cence test material, a surprising corre- 
spondence in results was secured. Further 
checks against the colored pencil technique, 
however, are desirable before it is exten- 
sively adopted, 

With one of the three techniques de- 
scribed, therefore, it is possible to deter- 
mine the optimum administration time of any 
test for which a criterion of validity is 





E. F. Lindquist and W. W. Cook 167 


available, In any experiment designed to de- 
termine the relative validity of two ways of 
testing the same ability, the time in which 
each form is administered in the controlled 
comparison should be this optimum time, de- 
termined beforehand for each form through 
preliminery experimentation. The coefficient 
of validity determined for each form at its 
optimum administration time must then be 
used in Kelley's formula to predict the va- 
lidities of similar tests of equal length, 
and the critical comparison is made in terms 
of these predicted validities. In other 
words, the tests should be compared in terms 
of validity for a standard period of testing 
time, when each test is administered at its 
own optimum rate, 

While this concept of optimum adminis- 
tration time and the techniques described 
will find their principal application in re- 
search into the relative effectiveness of 
objective test forms, they also carry some 
important practical implications for achieve- 
ment test construction and evaluation, 

Let us consider first the nature of the 
curves of obtained reliability and validity 
coefficients (in Figure 1) when the same 
test is administered in different times. 
Certain features of these curves are prob- 
ably characteristic of a large proportion of 
all achievement tests. In general, short 
time limits will result in relatively high 
reliability, with decreasing reliability for 
increases in time. In general, also, short 
time limits will result in low validity,with 
increasing validity for increases in time. 
This is probably due to the changing nature 
of the function measured with changes in time 
allowed, At the short time limit, speed is 
being emphasized rather than power, at the 
longer time limits the effect of the speed 
factor is minimized, The effect upon reli- 
ability may be accounted for by the fact thet 
speed may be more reliably measured than pow- 
er. When a measure of power is desired in 
achievement testing, it follows that the time 
conditions which contribute to maximum reli- 
ability are those which are associated with 
minimum validity. The implications of this 
fact for the practice of evaluating test ma- 
terials in terms of reliability coefficients 





1. Leo Francis McDonough, 





an Evaluation of a res for Determining Optimum Time of Administration of a Test, Mas- 
ter's thesis, State Uni owa, Iowa » Iowa. 





ee 


—_— pe 
eee We PT pein Ue ee 


* eA 


io Pe 


~— 


. 


CO a wr eee 








“De MES. ~ A EE 


Se a ae 





ieee eee 





t - 
sr 








168 


are most significant. High reliability is 
not to be considered as a desirable charac- 
teristic of a test if it is secured at the 
expense of validity by setting the time lim 
its too short. It is quite possible that 
this condition exists for many standardized 
achievement examinations now available. 

It is, furthermore, possible that the time 
limits set for standardized tests now in use 
do not represent close approximations to 
their optimm times. Further experimenta- 
tion alone will disclose how serious the sit- 
uation may be. The effect of too short a 
time limit has already been suggested. The 
effect of setting a time limit longer than 
the optimum may be less serious, but is, nev- 
ertheless, undesirable. It simply means 
that the test user is getting less returns 
from the time employed in testing than it is 
possible for him to secure. The standard- 

zed test builder, as well as the experiment- 
er with test techniques, is therefore faced 
with the problem of providing adequately for 
this time factor. In this letter case, how- 
ever, the problem is of the type earlier dis- 
cussed in this article, When the standard- 
ized test builder is faced with the necessi- 
ty of building a test for a predetermined ad- 
ministration time, set by the conditions of 
school administration, his problem is that 
of determining the optimum amount of materi- 
al to be administered in a given amount of 
time, rather than the optimum amount of time 
in which to administer 2 given amount of ma- 
terial, 

It should be emphasized again, in closing 
this discussion, that the optimum administra 
tion time (as here defined) for any given 
test is not that in which the test will yield 
the maximum validity. In general, the time 
in which a given achievement test will yield 
the maximum validity (unless it is definite- 
ly a speed test) is probably that which al- 
lows all pupils to finish comfortably. In 
general, it is believed that the curve of 
validity for a test, when administered at 
different times, is a rising curve, asymptot- 
ic toa horizontal line, of the type illus- 
trated in Figure 1. In situations, there- 
fore, where a possible waste in time is not 
& matter of serious consequence, such as in 
experimental situations, and where the elab- 
orate. procedures necessary to determine the 





JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No, 3 


optimum time for the measuring instruments 
used are not justifiable, the proper proce- 
dure would be to play safe by allowing ‘ple- 
ty of time" on all tests administered, 


II, THE RELATIVE VALIDITY OF SIX 
SELT=ADMINISTERING SPELLING TSST 
TECHNIQUES 


A large number of research studies have 
been meade and reported which have attempted 
to establish the relative validity and reli- 
ability of various self-administering objec- 
tive test techniques for measuring spelling 
ability. All of these studies, however, as 
well as the creat majority of similar studies 
in other fields, are rendered relatively in- 
conclusive by the failure of the investiga- 
tor to provide adequate control over one of 
the most important factors influencing test 
validity and reliability--the factor of ad- 
ministration time. The arguments supporting 
this contention have been presented in the 
preceding section of this article. It is the 
purpose of this section to report an experi- 
ment in which the application of the proce- 
dure described is illustrated in detail. Spe- 
cifically, this experiment attempts to deter- 
mine which one of six self-administering ob- 
jective spelling test techniques will yield 
the most valid measure of spelling ability 
in a given period of testing time, when each 
technique is administered at its own optimum 
rate and when a list-dictation spelling test 
is used as the criterion of validity. 

No less than twenty different test tech- 
niques are being used at the present time 
to measure spelling ability. At least five 
of these require that the administrator pro- 
nounce words or dictate sentences, or both, 
while other techniques are commonly desig- 
nated as "self-administering” tests because 
they do not require a stimulus cther than the 
prepared test form, While it is generally 
admitted that the dictation-recall types are 
most valid, there are numerous testing situ- 
ations in which it is difficult or impossi- 
ble to use them, In tests that are to be 
standardized for wide use the self-adminis- 
tering feature is particularly desirable. For 
this reason the present experiment has been 
limited to six of the most representative 
test techniques of the self-administering 





March, 1933 


type, selected because on an a priori basis 
they appeared to give the greatest promise 


of high validity. 


DESCRIPTION OF THE TESTS USED IN THIS EXPER- 
IMENT 

Test 1 is the criterion test. It consists 
of three lists (A, B and C) of fifty words 
each. The same fifty words contained in List 
A are also used in Tests 2 and 5; the fifty 
words in List B are also used in Tests 3 and 
6; and the fifty words in List C are also 
used in Tests 4 and 7. The method of pre- 
senting the words in the criterion test was 
the word-used-in-sentence list-dictation 
spelling test technique, in which the teach- 
er pronounces the word, then uses the word 
in a sentence, and then pronounces the word 
again. The pupils write only the word, as 
in an ordinary list-dictation test. The words 
were presented at the rate of one in every 
ten seconds, 

The word-used-in-sentence list-dictation 
technique was selected for the criterion test 
of this experiment because it is strictly a 
recall test and because the use of each word 
in a sentencs by the administrator obviates 
to a large extent the possibility of the pu- 
pils' misunderstanding the pronunciation of 
the word. It is the technique most frequent- 
ly recommended for use in spelling scales and 
is generally considered one of the most val- 
id of all spelling test forms, 

The findings of preliminary experimenta- 
tion by the authors indicate that a test con- 
structed of words of approximately fifty per 
cent standard accurecy! yields measures of 
higher validity and reliability than tests 
constructed of words at other levels or with 
different distributions of difficulty.” All 
of the words used in the tests of this ex- 
periment were therefore of as nearly fifty 
per cent standard accuracy as could be ob- 
tained, as determined by Ashbaugh by the pro- 
cedure used in deriving the Iowa Spelling 
Scales. The accuracies ranged from fifty- 
eight per cent to forty-five per cent, with 
a@ mean accuracy of fifty-two per cent. The 





E. F. Lindquist and W. W. Cook 





169 


words in each list were matched with words of 
identical difficulty in the other lists. The 
words used in the test were further limited 
to those found in Horn's A Basic Writing Vo- 
cabulery. The reliability of this criterion 
test, for the groups used in the experiment, 
was .98 + .002, 

Test 2 is a right-wrong spelling test, in 
which the pupils are presented with a pre- 
pared list (List A) of fifty words, approxi- 
mately half of which are misspelled. The fol- 
lowing sample will provide a sufficient in- 
dication of the nature of this test: 


Directions: If the spelling of a word is 
RIGHT, put a circle around the R in 
front of it; if it is WRONG, encir- 
Cle the W, as in the samples below, 


W 
This technique is used in the Purdue Eng- 


lish test. 

Test 3 is a recognition two-response spell 
ing test of fifty items (List B) the nature 
of which is indicated by the following sam- 


ple. 


l. 
a 
3. 


eranged 
article 
asuring 


R 
® 
R 


Directions: In the following spelling 
list, each word is spelled in two 
ways. You are to select the correct 
spelling of each word and put its 
NUMBER (not the word itself) in the 
parentheses at the right, as in the 
samples below, 


(2) fourty 
(2) always --------- 
(2) faught 


1. (1) forty 
2. (1) allways 
3. (1) fought 


This technique is used in the Cross English 
Test. The misspelled forms used in this test 
are those determined by Masters” to be the 
most frequently occurring misspelled forms 
for ninth-grade pupils. 

Test 4 is a recognition four-response test, 
in which the pupils are confronted with four 





gts of pootte {in thie eaey of eAtirgrede Pip) ne nn memes ame. 


ple of pupils (in this case, of eighth-grade 


A similar and more conclusive investigation of this particular problem has been r 


Lot tT May, 1952). 
A Study of Spelling Errors ty of Iowa, Iowa City, Iowa, 1927. 


* attioutte of a Test and Its ———— besa 
3. Harvey Victor Masters, 


rted by Thelaa G. Thurstone, "The 








ee 


43 Cet Cie ee, >. 


oe 


— 
i ert nd 


— 


— 


TS ne 


2a 


ee nee tes Le eles 


a a 


ee SS LL hl LL —C—s™s~—SsS“S™SCsSCN((SN 


Te. Wee SS SEES SS 


rer 


_— 
—— 


EP et te 
th A 


_ ¥> en = 
RIG te a IES 


LO ORE AEB LON OER eS 0 LE OTH IL SIDI LE 


a 
Om 
a 


— vor onan E 
2 By 
we, 


er 


ws ¥ 
(een, 


- 
| 
| 
ry 
: 

? 
wt} 
7 
4 





eS oe 


= 


—— 


by 


eri' 
last. — 


170 


spellings of each of fifty words (List B) 
the correct spelling and the three most fre- 
quent forms of misspelling, as in the fol- 
lowing sample: 


Directions: Each of the following words 
is spelled in four ways. You are to 
select the correct spelling of each 
word and put its NUMBER (not the 
word itself) in the parentheses at 
the right, as in the samples below, 


1. (1) fourty (2) forty (3) fortey 

(4) fourtey ---- (2) 
2. (1) always (2) alway (3) allways 

(4) alaway ----- (1) 
3. (1) fot (2) fought (3) foght 

(4) faut ------- (2) 


This technique is used in the Columbia Re- 
search Bureau English Test. Again the most 
frequently occurring misspelled forms are 
taken from Masters' Study of Spelling Er- 
rors. 

Test 5 is a list proof reading recall test 
of the type illustrated below. List A from 
the criterion test was used in this test. 








Directions: Some of the words in the fol- 
lowing list are spelled incorrectly. 
If a word is spelled correctly, place 
aC on the line opposite it. If it 
is spelled incorrectly, write the 
correct spelling of the word on the 
line opposite it, as in the sample 








below. 
As aranged arranged 
2. article c 
3. asuring Gesuring 





This technique is used in the Iowa Placement 
Examinations. Again the misspelled words 
are presented in the most frequently occur- 
ring misspelled form, 

Test 6 is a word-in-sentence recall test 
of the type illustrated below. List B from 
the criterion test was used in this test. 





Directions: The underlined word in each 
of the following sentences is mis- 


JOURNAL OF EXPERIMENTAL EDUCATION 


Volume I, No, 3 


spelled. Write the correct spelling 
in the blank below the sentence, 


1. It was an ekskwizit piece of lace, 





2. I am greatfull for your assist- 
ance, 





3. The accident was unfortunitate. 





An attempt is made to misspell the words 

in this form quite badly and yet make it 
possible for the pupil to recognize what 
word was intended. The purpose in this de- 
vice is to avoid as much as possible present- 
ing the student with the correct :relling of 
part of the word. 
Test 7 is a sentence proof reading recall 
test of the type illustrated in the follow- 
ing sample. List C from the criterion test 
was used in this test. 





Directions: Many of the following sen- 
tences, but not all of them, contain 
misspelled words. You are to find 
these misspelled words, underline 
them, and write the correct spelling 
in the space to the right of each 
sentence. If all of the words ina 
sentence are spelled correctly, place 
aC inthe space. The first items 
have been done correctly. 


1. Place the picture on the bulliten 
board, bulletin 
2. Ordinarily I work ten hours per 
day. __¢ 
3. I have no appology to make. 
apology 











EXPERIMENTAL PROCEDURE 

The experiment was conducted at the eighth- 
grade level with two groups of approximately 
400 pupils each, selected from sixteen high 
school systems, Complete sets of test scares 
were secured for 375 pupils in Group 1 and 
360 pupils in Group 2. Group 1 took Tests l, 
2, 3 and 4, and Group 2 took Tests 1, 5, 6, 
and 7. As has already been described, Tests 
2, 3, and 4 were equated with respect to the 














March, 1933 


difficulty of the words used, as were also 
Tests 5, 6 and 7, The procedure followed 
therefore obviated any serious practice ef- 
fect, since the same words were not pre- 
sented more than twice to the same group of 
pupils. Each of the tests except the cri- 
terion test was administered by means of the 
colored pencil technique described in the 
first article of this series. The experi- 
mental time limits used in the administra- 
tion of these tests were determined by means 
of preliminary experiments with an inde- 
pendent group of 140 eighth-grade pupils. 
The first time limit set for each test was 
that found necessary for approximately twen- 
ty-five per cent of the pupils to finish it. 


E. F. Lindquist and W. W. Cook 





171 


The second time limit was that required for 
approximately fifty per cent to complete the 
test; the third that required for approxi- 
mately seventy-five per cent, and the fourth 
that required for all pupils to complete the 
test. These time limits were chosen since 
it seemed likely that their range would in- 
Clude the optimum time--other time limits 
within approximately the same range would 
have served as well. 


ANALYSIS OF RESULTS 

The statistical measures computed for each 
test at each administration time are shown 
in Table I. Tests indicated as 2c and 3c are 
the same as Test 2 and Test 3 but the scores 




















TABLE I 
THE STATISTICAL MEASURES FOR EACH TIME LIMIT FOR EACH TEST 
Time Per 
in Cent Standard 
Test |Minutes Finished Mean Deviation Reliability Validity 
1 
Grp. I 25 100 73.99 + 1.317 37.85 + .931 -98 + .002 
Grp. 0 25 100 78.41 + 1.224 34.47 + .865 -97 + .002 
2 3 30 25.624 .318 9.15 + .225 -91 + .006 -58 + .023 
Not 4 60 29.32 + .286 8.21 + .202 -87 + .008 -69 + .018 
Corr. 5 80 30.93 + .253 7.28 + .179 -84 + .010 -74 + .016 
8 100 31.37 + .242 6.95 + .171 -82 + .012 -74 + .016 
2c 
Corr. 3 30 12.70 + .409 11.75 + .289 -82 + .012 -73 + .016 
4 60 14.10 + .440 12,64 + .311l -82 + .012 -76 + .015 
5 80 14.78 + .444 12.76 + .314 -82 + .012 -76 + .O15 
8 100 14.89 + .444 12.76 + .314 -82 + .012 -76 + .015 
3 4 36 32.92 + .328 9.43 + .232 -92 + .005 -64 + .021 
Not 5 66 35.61+ .290 8.33 + .205 -89 + .007 -69 + .018 
Corr. 6 86 36.709 + .267 7.68 + .189 -87 + .008 -71 + .017 
8 100 37.14 + .258 7.42 + .182 -86 + .009 -72 + .O17 
3c 4 36 24.14 + .439 12.62 + .310 -84 + .010 -73 + .016 
Corr. 5 66 25.33 + .443 12.72 + .313 -63 + .O11 -76 + .O15 
6 86 25.69 + .449 12.91 + .317 -84 + .010 -77 + .014 
8 100 25.92 + .448 12.88 + .317 -83 + .O11 -77 + .014 
4 6 25 25.28 + .375 10.77 + .265 -93 + .005 -79t .018 
Not 7 46 27.77 + .374 10.75 + .264 -93 + .005 -73 t .016 
Corr. 8 67 29.56 + .362 10.41 + .2 -92 t .006 -76 + .015 
10 100 30.85 ¥ .348 10.00 = .246 -91 + .006 -78 t .014 
5 4 16 23.30 + .295 8.31 + .209 -92 + .005 -62 + .022 
53 ae 28.51 + .270 7.61 + .191 -90 + .007 -74 + .016 
7 84 30.81 + .245 6.91 + .173 -87 + .009 -61 + .012 
9 99 31.17 + .237 6.68 + .168 -85 + .010 -83 + .O11 
6 8 16 17.19. .364 10.25 + .257 -93 + .005 -60 + .013 
10 51 19.52 + .387 10.90 + .274 -93 + .005 -84 + .010 
12 77 20.44 + .366 10.88 + .273 -93 + .005 -85 + .010 
14 96 20.76 zt - 384 10.83 z 272 93 zt .005 86 + -009 
7 8 24 19.20+ .311 8.75 + .220 -90 + .007 -78 + .014 
dé 49 21.53 + .323 9.10 + .228 -90 + .007 -81 + .012 
11 73 22.81% .326 9.17 + .231 -91 + .006 -64 + .O1l 
14 99 23.51 + .327 9.21 + .231 -91 + .006 -85 + .010 

















te tm 


DP POE OO Rg oe 








172 


have been corrected for guessing. The coef- 
ficients of reliability for each time limit 
for each test were predicted by means of the 
Spearman-Brown Prophecy Formula from the cor 
relation between scores earned in that time 
on random halves of the test. The coeffi- 
cients of validity for each test were sé=- 
cured by computing the correlation between 
scores obtained at each time limit and the 
scores on the criterion test. The data on 
reliability and validity for each test are 
also shown graphically in Figures 2 to 9 in- 
clusive. 

It is immediately clear from these data 
that both the reliability and the validity 
of tests of this type are to a very signifi- 
cant degree a function of the time in which 
they are administered. In general, increas- 
es in time from the shortest to the longest 
periods result in decreases in reliability 
and increases in validity. In general, al- 
so, the most significant effects are noted 
at the shorter time limits. At the longer 





JOURNAL OF EXPERIMENTAL EDUCATION 

















‘ cy . 7 . © © ’ . ' + 7 . 
mates of Maating twmea 
Pig. B Mah Wrong. Net Corrected for tossing 
Test 2 











, ‘ ' ‘ , 
pmamwtes of tasting rome 
iy. # Two Response Becogytion Corrector 
& Gang Wr 
Wvarroting "he Obtained and Predicted Veldities ond Reliodilitias for Test 2 
and 3 for Verteus Amounts of Testing Tne 





| ed at its own optimum rate. 





Volume I, No. 3 

















" ww , 
ewes of aay name 
Pig. P Proet Rending, List Resell, Test 5 


LJ Ls | 














oe -, 
t 4 4 
roy se 
* ” 
ooooeo* 4 
{| i 
. ? wo t 
- ¥ ~ 
a“! 
had | 
"FT a a | Sy. CL eS ee 


ery me 


amentes of fasting tome mes of 
Pig. 9 Proof Reading Recall, Test 7 


Big © Word -in- Serence Recall, Test 6 


Musrreting the Odtemed and Precicted Veldines ond Rehobines for 
‘Tests 4.5.6 ond 7 for Veriove Amounts of Tasting Time 


time limits, the curves of both reliability 

and validity tend to flatten out, i.e., re- 

liability and validity tend to become con- 

stant when the time limit exceeds a certain 
length. These facts establish conclusively 

the necessity of controlling the time factor 
in comparisons between test techniques. 

The way in which Figures 2 to 9 may be em- 
ployed in determining the optimm administra- 
tion time for each test has been fully ex- 
plained in the preceding section. Colums 1, 
2, 3, and 4 of Table II summarize the con- 
clusions which may be drawn from these dia- 
grams. Columns 4 and 5 of this table show 
the reliability and validity coefficients of 
a 50 word test of each type when administer- 
These coeffi- 
cients, however, cannot be directly compared 
from test to test, since the optimum times 
for the different types are not the same, 
Such comparisons can be fairly made only be- 
tween tests that require the same time in ad- 
ministration. For this reason a standard 


time limit of 12 minutes was selected, in 
terms of which comparisons could be made, 





March, 1933 


TABLE II 


E. F. Lindquist and W. W. Cook 


SHOWING OPTIMUM TIME AND VALIDITY FOR THE SEVERAL TESTS 





3 oa 


of 50 word test 


Optimum rate 
Reliability 


( 


Optimum time 
for 50 words 
(minutes) 
words per 
minute) 


5 


Validity of 
50 word test 
at optimum rate 


6 


Number of words 
in optimum 12- 
minute test 


| 


_ 


optimum 5-minute 


Validity of 


Validity of opti- 


—— 
© 


mum 20-minute test 


} 
| 





Re 
hHONOOMO 
ece eee 


QAow 


oon 


wr 


NooonLa 
. wwe we 7 
SSLLSSE! at optimum rate 


NOAXLAGw | Test 
— 
0 


91 


i 
~ 














-84 








. . 
yIaIA4 
goze test 

— 


747 
735 
-81l 
- 792 





® “2D om 
Sar mum 12-minute test 


OOo 
oc 
— 


*, | Validity of opti- 


~2 
© | 
on 


827 
-750 
825 
-805 
852 
868 
864 








The number of words necessary to constitute 
a 12-minute test when administered at the 
optimum rate was then determined for each 
type of test. Then again by means of the 
formula provided by Kelley, a prediction was 
made of the validity of this augmented test 
from the data given in columns l, 3, and 4, 
These predicted validities, given in column 
7 of Table II, apply therefore in each case 
to a 12-minute test in which the words are 
administered at the optimum rate for the 
type of test in question. 

We are now ready, in terms of a standard 
time limit of 12 minutes, to rank the vari- 
ous types of test in order of their validity. 
The three highest are those of the types in- 
volving recall, viz., the word-in-sentence, 
the list proof reading and the sentence proof 
reading types, of which the valicities are 
85, .85, and .64 respectively. Those of 
intermediate validity are the recognition 
types in which correction is made for guess- 
ing, viz., the right-wrong and two-response 
types, of which the validities are .81 and 
.80 respectively. The three lowest are the 
recognition types when uncorrected for guess- 
ing--the four-response, the right-wrong, and 
the two-response types--of which the validi- 
ties are .79, .78, and .73 respectively. 

It should be recognized that these valid- 
ity coefficients are all predicted by means 
of theoretical formulas from data that are 
themselves subject to errors of sampling, of 
measurement, and of approximation (in setting 























optimum times). The comparisons made should 


therefore be accepted with caution. 
no fornulas are available for computing the 
probable error of sampling and of estimate 
in predicted coefficients of this type, no 
conclusive comparisons are possible. 
pears likely, however, considering the size 
of the groups used and the care exercised in 
controlling conditions, that the recall types 
are significantly superior in validity to the 
types involving recognition only. 
appears probable that corrections for guess- 
ing are desirable, particularly for the right- 
wrong type of recognition test in spelling. 
It should be noted also that these compar- 
isons may differ when based upon standard 
The 12-min- 
ute standard was chosen in this case because 
it happened to be the longest of the optimm 
Other com 
parisons have been made, in columns 6 and 8 
of Table II, for standard times of 5 and 30 
It will be noted that 
differences in the relative magnitude of the 
validity coefficients obtain for these addi- 
tional time limits, particularly for the 5- 
In general, for two tests 
of the same validity, the one whose reliabil- 
ity is the lower is that for which greater 
returns in increased validity may be secured 


time limits of other magnitudes, 


times of the tests investigated. 


minutes respectively. 


minute interval. 


through lengthening the test. 


Since 


It ap- 


It also 


This accounts 


for the fact that certain of these types show 
greater increases in validity with increases 
It would 
also appear, from these data, that a test of 
much more than 12 minutes in length is not 


in the standard time than others. 





a 


Ce 


ns Ce ae neg ak 1 


ats oe Oe 


Raper 
a 


— 
Fee 


a 
i 


> 


. WS a EP i SRO ree 
ee 


Rhee SS We ee 


(OM OR rr ce 








ac paar. 


2 


ASP DO FER EOE LOO ER, OT PIS NE ae 


_ a Si, 
OES 





174 JOURNAL OF EXPERIMENTAL EDUCATION 


justifiable for many of these types of spell 
ing tests. In that amount of time, or less, 
the practical limit of validity is closely 
approached, 


SUMMAR Y 


While this article presents important ev- 
idence concerning the relative effectiveness 
of certain forms of self-administering spell- 
ing tests, its major purpose has been to 
demonstrate the application of a new experi- 
mental procedure for controlling the time factor 
in test evaluation and to raise certain the- 
oretical issues for the test experimenter. 
The major conclusions and implications drawn 
from the findings are summarized in the fol- 
lowing statements, certain of which should 
be accepted with caution or considered as 
merely tentative, due to the possibility of 
experimental errors of sampling, of measure- 
ment, and of prediction. 

The reliability and validity of objective 
tests of the types investigated have been 
shown to depend to a very significant degree 
upon the time in which they are administered, 
thus establishing the necessity of controll- 
ing this factor in comparisons of the effec- 
tiveness of test techniques, 

The nature of the relationships between 
reliability and validity and the length of 
the administration period has been determin- 
ed for certain self-administering spelling 
tests. For these tests, increases in the 
administration time result in increases in 
validity and in decreases in reliability, un- 
til a certain length of period is reached 
beyond which both validity and reliability 
tend to remain constant. 

The optimum rate of administration for 
each of certain self-administering spelling 
tests has been tentatively determined, and 
an experimental procedure for determining 
this optimum rate has been demonstrated. The 
types involving recall show the slowest op- 
timum rate (words per minute). The right- 
wrong type when correction is to be made for 
guessing should be administered with a short- 
er time than when no such correction is to 
be made, 

Six types of self-administering spelling 
tests have been ranked in order of their va- 
lidity for a standard administration iod 








Volume I, No, 3 


when each type is administered at its owm 
optimum rate. Those involving recognition 
only are least valid, while those involving 
recognition plus recall yield the highest 
validities, 

Corrections for guessing have been show 
to increase the validity of certain recogni- 
tion types of spelling tests, particularly 
the right-wrong type. 

The practical limit in validity for self- 
administering spelling tests of the types in- 
vestigated is closely approached in tests of 
12 minutes or less in length. 





III. INDICES OF DISCRIMINATION FOR 
SPELLING TEST TECHNIQUES 


In the selection of items, after try-out, 
for a test intended to measure a given abil- 
ity, one is concerned particularly with two 
attributes of each item: its difficulty and 
its validity. Quantitative measures of dif- 
ficulty, or frequency of error, may be eas- 
ily secured for individual test items, but 
the problem of securing a quantitative meas- 
ure of the validity of a single test item has 
not been satisfactorily solved. A practica- 
ble measure of this type would be extremely 
valuable and would probably make possible a 
marked improvement in the validity of many 
types of standardized achievement examina- 
tions. 

A number of suggestions have been made for 
measuring the "goodness" of a single test i- 
tem, Patterson~ has reported a method for 
evaluating items in a psychological test as 
follows: Students in a class are divided in- 
to five groups, A, B, C, D, and E, on the 
basis of the final grade of the course, The 
test papers for the year are examined for 
each group and the percentage of pupils in 
each group answering each question correctly 
is determined, The evaluation of a given i- 
tem is then based on the extent to which the 
percentage of correctness increases from 
group to group. 

Vincent® has used the following procedure 
in evaluating intelligence test items: "Find 
the median total test score of the people who 
passed the element; find how many of the peo- 
ple who failed the element have total test 
scores which reach or exceed that median; 
find “ per cent that number is of the 

‘onkers—on-Hudson 











1. Donald G, Patterson, tion and 
2. Leona Vincent, A S 
Publications, Tea » 


Pp. 
ons to Education, No. 152 (New York City: Bureau of 


versity, 1924), p. 9. 





March, 1933 


total number of people who failed the el- 
ement." 

symonds? has recommended the following 
procedure as a rough and ready method of de- 
termining the "goodness" of test items: At 
the end of the term select the test papers 
of the best 20 pupils in the class and the 
poorest 20 pupils in the class, Tabulate 
the pupils’ correct response on each item 
of the test. The measure of the "goodness" 
of the item is the difference between the 
number of correct responses for the good 
group and the poor group. 

Other workers* in the field of examination 
construction have suggested other quantita- 
tive measures of goodness, perhaps the best 
of which is bi-serial r between success or 
non-success on the item and a criterion meas- 
ure. Most of these devices, however, have 
only been suggested as approximate or rule- 
of-thumb procedures for use in informal test 
construction. Very few attempts have been 
made to use such techniques on a large scale 
in the construction of standardized achieve- 
ment tests. Moreover, very little experi- 
mental examination has yet been reported of 
the relative effectiveness of these various 


techniques, nor has much careful considera- 
tion been given to their characteristics, fa- 


vorable and unfavorable. It has yet been 
left largely to opinion which one, if any, 
of these procedures should be employed. 

In the present study an attempt has been 
made to develop certain new procedures for 
computing an “index of discrimination" for 
test items, to make an empirical investiga- 
tion of the characteristics of these and 
other procedures of the same kind, and on 
the basis of this comparison to make recom 
mendations concerning their use. The appli- 
cation of the techniques described may in 
some instances be practicable only where 
electric tabulating equipment is available, 
and they may therefore be of intrinsic in- 
terest only to a limited group of test tech- 
nicians. It is believed, however, that be- 
cause of the discussion of the objective 
characteristics of a desirable test item, 
and because of the issues in test construc- 





E. F. Lindquist and W. W. Cook 


175 


tion which it raises, this article may never- 


theless be of general interest to users as 


well as to builders of test materials, 


The discriminating power of a single test 
item refers to the degree to which success on 
that item by itself indicates possession of 
the general ability which is being measured, 
It may be defined as the accuracy with which 
a pupil can be placed with reference to a 
point on the scale of ability on the basis 
of success or failure on the given item. An 
item may be said to be perfect in discrimin- 
ating power when every pupil who scores suc- 
cessfully on the item ranks higher on the 
scale than any pupil who fails the item. An 
item may be said to have zero discriminating 
power when there is no systematic difference 
between the ability of the pupils who succeed 
on the item and those who fail. 

Various hypothetical degrees of discrimin- 
ating power for a spelling test item of 50 
per cent difficulty are presented graphically 
in Figure 10. This figure shows the various 
types of relationship which may be found be- 
tween general spelling ability, as measured 
by an adequate criterion test, and the abil- 
ity to spell a given single word (in this 
case a word spelled correctly by 50 per cent 
of the pupils in a hypothetical experimental 
group). The vertical scale in this figure 
indicates the per cent of pupils in each dec- 
ile of general ability who spelled the word 
correctly. The placement of pupils on the 
(horizontal) ability scale is determined on 
the basis of their percentile standing on 
the criterion test. The "line"of discrimin- 
ation" for a given word indicates the in- 
creasing accuracy with which pupils at suc- 
cessive levels of ability spell the word. 
It should be noted that the horizontal abil- 
ity scale, being expressed in terms of per- 
centiles, is not a true scale. The value of 
the percentile "unit" increases in magnitude 
(as measured in raw score units) as either 
end of the scale is approached. This fact 
partially accounts for the characteristic 
shapes of these "lines of discrimination." 

Line MM represents the line of discrimin- 
ation for an item of 50 per cent difficulty 





l. Percival M. Symonds, t 2 


2. Since this article was on 


Education (New York: The Macmillan Company, 1927), p. 585. 
uation of Methods of Evaluating Tesi Items," by Theo. F. Lentz, 


Jr., Bertha Hirshstein, and J. i. “Pinch has been published in the May 1952 number of the foam St pean 
The method of approach to the problem, however, and the specific techniques evalua er 


chol 
from Bee of the present article. 





f 
3% 
4 


Tees 





See Se FT Se 





- one 
. 


i. Se PHS 


a 
aoe 


hare “ ¢ 
7 see - 


PE. FEO A ARE ERD OB 29 RS lt EO ee 


i ee 


~ 


ee 
ee 


MOSIE Tt 


= 





176 JOURNAL OF EXPERIMENTAL EDUCATION 


which shows perfect discriminating power, 
since every pupil below the 50th percentile 
of ability misspelled the word, and every 
pupil above the SOth percentile spelled it 
correctly. The pupil who spells such a word 
correctly may then be accurately placed on 
the ability scale with reference to one 
point, in this case the point of median a- 
bility. Only a dichotomous classification, 
of course, is possible; the pupil's distance 
above or below the median point cannot be 
determined. 

Line UU represents the line of discrimi- 
matim for an item of 5O per cent difficulty 
which has zero discriminating power, since 
the same per cent of puplls at every ability 
level spelled the word correctly. When a4 
pupil misspells such a word there is no 
greater reason for placing him on the power 
part of the ability scale than the upper; 
i.e., he is no more likely to be a poor than 
a good speller in general. 

Line VV represents the line of discrimi- 
nation for an item of 50 per cent difficulty 
which has minus discriminating power, since 
it is spelled more frequently by pupils of 
low spelling ability than by pupils of high 
speliing ability. A pupil who spells such a 
word correctly is more likely to be a poor 
speller than a good one. Several words of 
this type were discovered in this investiga- 
tion, 

Between the extremes of perfect and minus 
discriminating powers, items of all degrees 
of discrimination may be found. These are 
illustrated for items of 50 per cent diffi- 
culty in Figure 10 by lines NN, 00, PP, QQ, 
etc. 

There is no apparent reason for assuming 
any relationship between the discriminating 
power of a test item and its difficulty or 
its percentage of incorrect responses. An 
item of any difficulty may have any degree 
of discriminating power. In Figure 11, hy- 
pothetical lines of discrimination are il- 
lustrated for good and poor items at three 
levels of difficulty. Line N'N' represents 
an item on which 80 per cent of the respons- 
es were correct with very high discriminat- 
ing power, and line V'V' represents an item 
of the same difficulty but with very low dis- 
criminating power. The lines N"N" and v"v" 





represent items on which 50 per cent of the 





Volume I, No. 3 


responses were correct with high and low 
discriminating power respectively. Lines 
N'N', N"N", and N'N'" represent items of 
marked differences in difficulty but with 
the same degrees of discriminating power. 

An ideal test might consist of items of 
high discriminating power such as are repre~ 
sented by the lines N'N', N"N", NNN, eto, 
distributed evenly throughout the difficulty 
scale, or it might consist of items of medi- 
um discriminating power, as represented by 
the line RR in Figure 10, all of which are, 
say, between 20 per cent and 80 per cent dif- 
ficulty. The latter ideal, however, is prob- 
ably more practicable, since in actual prac- 
tice it is extremely difficult to find items 
with the very high degrees of discriminating 
power such as are represented by the very 
nearly vertical lines in Figure ll. 








Items of Verious Difficulties Mey Be Either High or 
Low in Discrimineting Power 


In actual test construction on any large 
scale it is not expedient, of course, to con- 
struct curves of this kind for each individ- 
ual item on the basis of preliminary experi- 
mentation. If discriminations are to be made 
between items of various degrees of goodness 





varch, 1933 E. F. Lindquist 
they must be made on the basis of a single 
quantitative measure, or "index of discrim- 
ination,” that can be conveniently computed 
and readily tabulated and compared for a 
large number of items. 

Several desirable characteristics of such 
a measure can be immediately suggested on an 
a priori basis. It must be computed on the 
pasis of successes and failures on the given 
item and of measures secured from an adequate 
criterion test. The method of computation 
should preferably be simple and reasonably 
rapid. It is desirable that the index for 
an item of perfect discriminating power be 
unity, and for an item of zero discriminat- 
ing power be zero, There should be no rela- 
tionship between the numerical value of the 
index of discrimination and the difficulty 
of the item, An adequate procedure for com 
puting an index of discrimination with these 
characteristics would not only facilitate 
the selection of superior items in test con- 
struction, but would also make possible an 
analytic study of items of high and low dis- 
criminating power to discover the factors 
which contribute to their effectiveness. The 
selection of items which discriminate between 


pupils with various types of special defi- 
ciency for use in diagnostic tests is also a 
possibility in this field, 

With these theoretical considerations and 
possible applications in mind, five specific 
indices of discrimination were selected for 


examination and evaluation. Certain of these 
were selected because they, or other measures 
with essentially the same characteristics, 

have been recommended and used by recognized 
workers in this field. Certain others were 
empirically developed by the authors in an 
attempt to improve on those previously sug- 
gested. The comparisons made will be based 
upon the data collected through the adminis- 
tration of a list-dictation spelling test of 





200 words to 460 eighth grade pupils. On the 


and W. W. Cook 177 

basis of these data, each of the five select- 
ed indices of discrimination was computed for 
each word, and for each word also a detailed 

and reliable description of its discriminat- 

ing qualities was secured by constructing its 
"line of discrimination."1 

The 200 word test consisted of four lists 
of fifty words each, selected from Hornts "A 
Basic Writing Vocabulary." All words in list 
I were of 75 per cent standard accurac at 
the eighth-grade level. List II words were 
of 50 per cent standard accuracy, and list 
III words of 25 per cent standard accuracy, 
The words in list IV ranged from 14 to 86 per 
cent standard accuracy with a mean of 50 per 
cent. This division enabled a study of the 
behavior of the indices for test items at 
different levels of difficulty. The scores 
on the entire test of 200 words were taken 
as the criterion for ranking the pupils in 
general spelling ability. The reliability 
of the criterion test for the group used was 
-987 + .001, predicted from the correlation 
between scores on odd and even items, 

In terms of the data thus collected and 
derived, comparisons were made between the 
various indices on each of the following ba- 
ses: 

(1) The “correlation” of each index with 
the discriminating power of items as shown 
by their lines of discrimination. This was, 
of course, in each case an "inspectional cor- 
relation," and could not be reduced to quan- 
titative terms. By this method, an index was 
considered good if its value was relatively 
high for items whose lines of discrimination 
were sharply rising, and if its value was 
relatively low only for items whose lines of 
discrimination approached the horizontal, 
Special attention was given to the lines of 
discrimination of those items which ranked 
high according to one index and low accord- 
ing to another. 





1. The "lines of discrimination" were constructed as follows: 
and divided into ten groups of 46 each. The first group thus included 
centile on the general ability scale, and each subsequent group the 


on that scale. 


The per cent of correct spellings for each word was 


The line of discrimination for a given word is then the line connec 
cent accuracy with which pupils at each successive level of ability 
thus constructed are shown for certain selected words in Figures 12 


is meant the per cent of pupils 


the word correctly in a 


spelling 
accuracies were determined by E. J. Ashbaugh and are reported in 





i 


a RE 








178 


(2) The product-moment coefficient of 
correlation, based on all words, between 
each index and the obtained difficulty of 
the words. If an index showed a high cor- 
relation with difficulty it was judged un- 
satisfactory, on the grounds that theoreti- 
cally there is no relationship between dif- 
ficulty and discriminating power. 

(3) The inter-correlations between the in- 
dices (secured for each list separately as 
well as for the total or combined list). 

(4) The simplieity and convenience of the 
operations involved in the computation of 
each index, 

(5) The reliability of each index, i.e., 
its lack of susceptibility to sampling fluc- 
tuations. This latter characteristic could 
only be inferred from the nature of the in- 
dex, and was not objectively measured. 

The five indices with which this study is 
concerned will hereafter be identified by 
letter, and are briefly described in the fol- 
lowing paragraphs. Because of lack of space, 
no description is included of the methods of 


computation which may be employed in practice! 


to secure these indices. It should be noted, 
however, that where Hollerith tabulating e- 
quipment is available, short-cuts in large- 
scale computation may be employed whereby 
even the most complicated of these indices 
can be computed from punched cards at a min- 
imum rate of fifty per hour and in some cas- 
es at more than three times that rate. 

No attempt will be made, furthermore, to 
include in these descriptions any detailed 
justification of these indices on a logical 
basis. All of these indices except the last 
have been empirically derived and must be 
justified upon the basis of performance rath- 
er than upon the basis of logical considera- 
tions. The reader is requested, therefore, 
to avoid any a priori reasoning about these 
indices at this point, and to reserve his 
judgment until the evidence concerning their 
objective behavior has been presented. 

For the sake of convenience, each of these 
descriptions will be phrased in terms of the 
spelling test situation to which reference 
has already been made. In the computation of 
each of these indices, all the pupils in the 
experimental group were arranged in order of 
score made on the total test. Wherever any 


JOURNAL OF EXPERIMENTAL EDUCATION 





reference is made in the discussion, there- 





Volume I, No.3 


fore, to the total experimental group, it is 
to be understood that the pupils or the test 
papers are arranged in this order. Accord- 
ingly, if any reference is made to a portion 
of the total group--as, for example, the up- 
per one-fourth--it is to be understood that 
by this upper one-fourth is meant those pu- 
pils scoring above the seventy-fifth percen- 
tile score on the total test. 

Index A: This index for any given word 
may be defined as the ratio between the num 
ber of correct spellings in the upper and the 
lower one-fourths of the experimental group. 
For example, in this investigation the word 
"fraternity" was spelled correctly by ninety- 
one per cent of the pupils in the upper one- 
fourth of the experimental group and by eight 
per cent of the pupils in the lower one- 
fourth. Index A for the word "fraternity" is 
then or 11.3. 

Index B: This index may be most conven- 
iently defined in terms of the following al- 
gebraic equation: For any given word 

Index B = U=L 

7. 
in which U is the per cent of correct spell- 
ings of the word for the pupils in the upper 
one-fourth of the group, and L is the per 
cent of correct spellings for the pupils in 
the lower one-fourth. The denomfnator of 
this expression is taken as seventy-five be- 
cause that happens to be the number of per- 
centiles of ability between the median scores 
of the upper one-fourth and the lower one- 
fourth of the experimental group, and because 
it results in an index above unity only in 
rare instances. The maximum value that this 
index may have is 1.33. 

Index C: Index C may perhaps best be ex- 
plained by describing its computation in 4 
particular case. The word “adapted” was mis- 
spelled by 128 of the 460 pupils in the e6x- 
perimental group. This word could therefore 
be perfect in discriminating power only in 
the event that all of the 128 pupils misspel- 
ing it ranked lower on the test scale than 
any who spelled it correctly. In other words, 
the word would be perfectly discriminating 
only if every one of the 128 lowest scoring 
pupils misspelled it. As a matter of fact, 
however, only 63 of the 128 lowest scoring 
pupils misspelled the word. The ratio between 





March, 1933 E. F. Lindquist 
63 and 128, or .492, may apparently there- 
fore be taken as a measure of the discrimin- 
ating power of the word. Index C might, 
therefore, be defined for any given word as 
the per cent of misspellings in that group 
which consists of all those pupils who would 
have misspelled it had the word been perfect 
in discriminating power but of the same dif- 
ficulty. It may be algebraically defined as 
follows: 


= R 


Index C 


in which R equals the number of pupils in 
the total group misspelling the item, (R is, 
therefore, also equal to the number of low- 
est scoring pupils who would have misspelled 
it had it been perfect in discriminating pow- 
er), and S equals the number of pupils mis- 
spelling the word among the lowest R pupils 
in the experimental group. This index has 
the desirable characteristic that its value 
is unity for a perfectly discriminating i- 
tem, but it also has the undesirable charac- 
teristics that there is no definite value to 
correspond to an item of zero discriminating 


power and that its value remains positive 
for negatively discriminating items. 


Index D: The derivation of Index D is 
more purely empirical than is that of any of 
the preceding indices. For that reason, no 
attempt will be made to present a logical 
justification for it other than to point out 
that it avoids the difficulty just mentioned 
for Index C and results in a value of zero 
for an item with no discriminating power, a 
value of unity for a perfectly discriminat- 
ing item, and a negative value for a nega- 
tively discriminating item. It may be con- 
veniently defined only in terms of the fol- 
lowing algebraic expression: 


ST - R? 
RT - R® 


in which R and S have the same significance 
as before, and T represents the number of 
pupils in the total experimental group. 

Index E: This index for any given word 
is simply the bi-serial r between success or 
failure on a given item and total score on 
the criterion test. 

As has already been indicated, each of 
these indices was computed for each of 200 


Index D 


and W. W. Cook 179 
words in a list-dictation spelling test which 
had been administered to 460 eighti-erade 
pupils. Space will not permit the presenta- 
tion of the data for all of the words, but 
for illustrative purposes a random sampling 
of one-tenth of these words has been select- 
ed and presented in Table III, together with 
the difficulty measures and the values of 
each index for each word. Table IV shows 
the average difficulty and the average value 
of each index for the words in each of the 
four lists constituting the total test. It 
will be remembered that list I consists ex- 
Clusively of easy words (of seventy-five per 
ent standard accuracy), list II of words of 
intermediate difficulty (fifty per cent 
standard accuracy), list III of difficult 
words (twenty-five per cent standard accur- 
acy), and list IV of words uniformly distrib- 
uted over the accuracy scale (from fourteen 
to eighty-six per cent standard accuracy). 
Column 1 of Table IV shows the obtained av- 
erage per cent of correct spellings of the 
words in each list for the experimental group. 
The lower row in the table presents the cor- 
responding measures for the total or combined 
list. 

Table V presents the standard deviations, 
for each list separately and for the total 
combined list, of each of the measures whose 
means are given in Table IV. 

Table VI presents ali of the inter-corre- 
lations (product moment coefficients) between 
the six measures already described (the meas- 
ure of difficulty and the five indices of 
discrimination) for each of the four lists 
separately and for the total combined list, 

Figures 12-15 present graphically the 
discriminating power of selected items which 
rank high or low according to different in- 
dices. The method employed in constructing 
these lines of discrimination has already 
been described. (See Ficures 12-15 on page 
181.) 

In terms of the facts presented in these 
tables and diagrams, we are now ready to pro- 
ceed to an evaluation of the various indices, 
The conclusions which may be drawn concern- 
ing each index from the facts thus presented 
are summarized in the following paragraphs, 

Index A: This index may be immediately 





|eliminated from serious consideration on the 
|basis of the obtained correlation between 





~e 
FE = 7 


oO ee 
. up, a 














180 JOURNAL OF EXPERIMENTAL EDUCATION 


TABLE III 


THE OBTAINED PER CENT CORRECT SPELLINGS AND THE 
FIVE INDICES FOR EVERY TENTH WORD IN THE FOUR 





Volume I, No. 3 


TABLE V 


STANDARD DEVIATIONS OF PER CENT CORRECT SPELL- 
INGS AND OF VARIOUS INDICES FOR WORDS IN EACH 
LIST AND IN COMBINED LIST 


























LISTS 
Per Indices of Discrimination Per Indices of Discrimination 
Cent Cent 
Correct A B Cc D E Correct A B c D E 
Spellings T (bis) Spellings P (bis) 
List I 12.5 1.7 167 .114 .007 .11) 
List I List II 10.7 14.2 -170 .981 .102 .118 
List III 12.1 34.9 -224 .066 .109 .1235 
1 adapted 72.0 1.86 .587 .492 .299 .513/ List IV 21.6 28.1 e191 .155 .098 .151 
1l conserve 76.8 1.74 .529 .477 .320 .505/ Total List 21.2 25.6 e218 .145 .105 .132 
21 exert 71.2 2.55 .775 .5892 .434 .625 
31 bus 73.3 1.4 -2e99 .388 .127 .291 
41 diplomatic 74.0 2.8 -853 .619 .489 .705 
List II TABLE VI 
1 academy 47.5 6.5 -976 .756 .501 .706 THE INTERCORRELATION BETWEEN THE INDICES AND 
1l clammer 43.7 4.9 .669 .709 .347 .479 THE PER CENT CORRECT SPELLINGS FOR THE WORDS 
21 dissatisfied 38.7 4.6 -705 .733 .311 .508 IN EACH LIST AND FOR THE WORDS IN THE COM- 
31 gorgeous 50.1 6.5 976 .755 .518 .726 BINED LIST 
41 museum 51.8 5.9 965 .748 .525 .717 - . 
A B r 
List III ee 
List I 
1 requisition 13.1 99.0 -463 .895 .355 .690 
ll academic 24.9 7.5 -607 .821 .333 .609| Acc. -.637 -.373 -.831 -.208 -.148 
21 abhor 24.7 3.9 -469 .793 .267 .491) A -701 759 636 -607 
31 cornice 11.8 20.2 -461 .923 .438 .717) B -762 -893 -921 
41 censure 06.9 16.9 ~-180 .914 .164 .484 Cc - 706 -635 
D 954 
List IV 
List II 
1 laborer 78.5 1.9 -588 .526 .372 .509 
ll faucet 42.4 8.1 .912 778 .493 -694 Acc. -.529 .232 -. 783 -.004 -.2l1l 
21 genteel 16.8 7.0 427 .878 .297 .526) A -.059 488 -142 265 
31 version 73.1 2.1 -681 .508 .341 .604/ B 357 .887 -837 
41 vicious 52.3 3.4 769 .667 .378 .587/) C -609 708 
——= D -919 
List III 
Acc. -.221 .526 -.626 188 ~9000 
A -.167 565 -100 484 
B -.427 840 545 
Cc 088 354 
D . 787 
TABLE IV 
List IV 
AVFRAGE PER CENT CORRECT SPELLINGS AND AVERAGE 
VALUE OF EACH INDEX FOR WORDS IN EACH LIST Acc. -.544 205 -.949 -.162 -.332 
AND IN COMBINED LIST A -.204 596 301 -619 
B -.055 763 425 
Per Indices of Discrimination C 356 +494 
Cent D -834 
Correct A B Cc D E 
Spellings ? (bis) Combined Lists 
List I 64.9 2.9 .695 582 364 . 566 Acc. -.507 270 -.916 -001 - 254 
List II 43.3 9.8 .621 .7387 .409 .631/ A +239 -549 -094 -451 
List III 22.1 30.1 -548 853 347 ~627 B 930 829 - 566 
List IV 42.2 17.4 629 .718 .351 .592 c - 287 -497 
Total List 43.1 15.1 .673 .723 .368 .604| D 835 








Nove: It should be kept in mind in reading this 
able that a negative correlation with per cent 
correct spellings is the same as a positive cor- 
relation with difficul.ty. 





warch, 1933 











a a a ae ee ae ae 
Percentile stending on criterion measure 


Pig. 13 


Percentite standing on criterion meosure 


Fig. 2 


22e2é8 


correct spethngs 
8 ss 


Per cent of 











eo 8 8 6 





onrnneveianonewornes © wo 
Percertile stonding on criterion meesure 


“Fig. 15 


cone & 8&® © © © W WO 
Percentile stonding on criterion measure 


Pig. (4 


Illustrating the Lines of Discrimination for 


Different Indices 


& 


é 
’ 
seaese PE i: 2 


dE 


me 
Ne 
of 
. 
CY 


it 
“ | 
af 

ii 

i: 


eebeescat 


0% 
on 
wt 
“ 


$e 
Ses 


ceascescieeee 
sheree 


t---Esctaiess fi 
ican dREETS BY 


Sasek_-HHHRE. PY 


titece BEI 
rhe 


seebee 
on FE 


etieet 


2333233 
aa 


its value and the difficulty of the words tc 
which it was applied. We note in Table VI 
that this index shows a negative correlation 
of .507 with per cent correct spellines, It 
follows, therefore, that to use this index 
as a basis for selecting superior items 
would to a significant degree be equivalent 
to selecting words on the basis of difficul- 
ty alone. This fact becomes more readily 
apparent when we note in Table IV that the 
index has an average value of 2.9 for the 
easy words, as compared with an average val- 
ue of 30.1 for difficult words. It should 
be noted that this index, which compares 
very unfavorably with all but one of the 
indices investigated, is essentially the 
Same as that previously referred to as rec- 
ommended by Symonds, and is perhaps among 
the most widely used of such indices at the 
present time. 


E. F. Lindquist and W. W. Cook 


181 


Index B: While this index shows a corre- 
letion with difficulty for the words in the 
combined list, it should be noted that the 
correlation is not sufficiently high to be 
of great significance. It should be noted 
also that this index shows a negative cor- 
relation with per cent correct spellings for 
the words in the easy list, and a relatively 
high positive correlation with per cent cor- 


/rect spellings for the words in the diffi- 


| 


| 


| 


Thus it tends to favor those 
words which are of medium difficulty, as is 
evidenced in Table IV by the fact that its 
average value for list 2 is .821 as compared 
to .695 and .548 for lists I and III respec- 
tively. Consequently, it tends to penalize 
those words which are either very easy or 
very difficult, particularly the latter. 
This characteristic, however, might be in- 
terpreted as a desirable one, since in gen- 


cult list. 


|eral very easy or very difficult test items 





affect too small a proportion of the entire 
croup to contribute significantly to the 
effectiveness of the entire test. 

Index B has the desirable characteristic 
that it may be rapidly and conveniently com 
puted, but has also the compensating weak- 
ness that it is based only upon the data se- 
cured from one-half of the experimental group, 
and hence is more seriously susceptible to 
sampling fluctuations than are certain of 
the indices which are later to be described, 
It tends to give a relatively high rank to 
words such as those illustrated in Figure 13, 
which discriminate effectively between pu- 
pils in the high and low ability croups but 
not between those in the intermediate range 
of ability. 

Index C: This index is without doubt the 
least satisfactory of those here considered. 
According to Table IV, it correlates +,92 
with the obtained difficulty of the words, 
and hence ranks items in almost exactly the 
same order as they would be ranked on the 
basis of difficulty alone. For example, it 
rates the word "occurrences" with only 01.1 
per cent correct spelling as the most dis- 
criminatine word in the combined lists. For 
words of approximately the same difficulty, 
it may be a fairly satisfactory basis for 
comparison, 

Index D: This index is probably as sat- 
isfactory as any of those here investigated. 

















182 


We note in Table IV that it shows no corre- 
lation with difficulty for the words in the 
combined list. We note, however, in the 
same table, thet it shows a slight negative 
correlation with per cent correct spellings 
for words in the easy list, and a slight 
positive correlation with per cent correct 
spellings for words in the difficult list. 
This means, as in the case of Index B, that 
it tends slightly to favor words of interme- 
diate difficulty. This conclusion is sub- 
stantiated by the evidence in Table IV, where 
we note that this index has an average value 
of .409 for the list of medium difficulty, 
as compared to .364 and .347 for the easy 
and difficult lists respectively. Again as 
in the case of Index B, this characteristic 
may be considered as advantageous to the in- 
dex. Index D has the apparent disadvantage 
of complexity and consequently of inconven- 
fence in calculation. This is of signifi- 
cance, however, only in situations where 
adequate tabulating equipment is not avail- 
able. Where Hollerith tabulating equipment 
is used, it may be computed almost as read- 
ily as any of the other indices considered, 
The stroneest evidence in support of In- 
dex D which was obtained in this investiga- 
tion cannot conveniently be presented here 
because of the fact that it is subjective or 
inspectional in nature and cannot be reduced 
to quantitative terms. The best single meas- 
ure of the discriminating power of an item 
used in this investication was the line of 
discrimination for the item constructed by 
the method already described. An examina- 
tion of these lines of discrimination for a 
large number of words and a comparison of 
the corresponding values of Index B for the 
same words showed that Index D invariably 
showed a high value for words that revealed 
a sharply rising line of discrimination and 
a low value for words whose lines of discrim- 
ination tended to approach the horizontal. 
This tendency was slightly stronger for In- 
dex D than for Index B, and definitely 
stronger than for indices A and C., In other 
words, the inspectional correlation between 
the value of Index D and the discriminating 
power of items as indicated by the lines of 
discrimination appear to be as high as, or 
nigher than, that for any of the indices A, 


JOURNAL OF EXPERIMENTAL EDUCATION 





B, and C. A few selected instances of this 





Volume I, No, 3 


type of evidence are presented in Figures 14 
and 15. We note in Figure 14, for example, 
that Index D is relatively high for the words 
"emphatically" and "pageant," which are sharp- 
ly discriminating, while it is relatively low 
for the words "forsaken" and "judgment," 
which are weak in discriminating power. Sim 
ilar comparisons may be made for the words 
illustrated in Figure 15. Finally, Index D 
has a value of unity only for perfectly dis- 
criminating items, a value of zero for items 
of no discriminating value, and negative val- 
ues for negatively discriminating items, 

Index E (bi-serial rr): This index, we 
note in Table IV, shows a slight negative 
correlation with accuracy of the words in 
the combined list, most of which is account- 
ed for by the words in lists I and II alone 
--that is, words of medium or slight diffi- 
cult}. To a very slight degree, therefore, 
it tends to penalize words of less than me- 
dium ¢'fficulty, as indicated by the facts 
in the last column of Table IV. 

This index, however, has the decided ad- 
vantage that it is based upon all of the da- 
ta obtained from the experimental group and 
for that reason is probably more reliable-- 
i.e., less subject to sampling fluctuations 
--than is any of the other indices. On the 
other hand,it has the compensating disadvantage 
of complexity and inconvenience in computa- 
tion. This disadvantage is such as to ren- 
der it almost wholly impracticable for appli- 
cation on a large scale except where Holler- 
ith tabulating equipment is available. For 
such equipment, however, the authors have de- 
vised a procedure whereby it has been com 
puted by a single operator at a rate as high 
as 5O to 75 items per hour, not including 
the time required to punch the Hollerith 
cards, 

The most convincing evidence concerning 
the desirability of this index is the in- 
spectional correlation which it shows with 
the discriminating power of items as indi- 
cated by the lines of discrimination con- 
structed for them. Instances of this type 
of evidence are presented in Figures 14 and 
15 and may be interpreted in the manner al- 
ready suggested. On the basis of evidence 
of this type, it appears to be very closely 
on a par with Index D. We note in Table VI 
that it shows a positive correlation of .835 


ee ee 





Warch, 1933 


with this latter index for the words in the 
comhined list, and an even higher correla- 
tion for the words in each of the sub-lists 
except that containing the more difficult 
words. 

There is some evidence that the bi-serial 
r index tends to favor extremely difficult 
items. We note, for example, in Figure 13 
that, while the words "diplomatic," "carni- 
val," and "clutches" show curves that are as 
sharply rising as those for "philosophical," 
"negligee," and "“legitimate."the latter words 
(especially the last one) show significantly 
higher values for Index E. As was true for 
Index D, Index E has 4 maximum value of 1.0, 
a minimum value of -1.0 (for negatively dis- 
criminating items), and is equal to 0.0 for 
items of no discriminating power. 

The preceding arguments suggest the fol- 
lowing conclusions: 

(1) Indices A and C are both related to 
difficulty in such a significant degree that 
they may be eliminated from serious consid- 
eration on that basis alone. (2) Index B 
compares somewhat unfavorably with indices D 
and E, particularly because it unduly pena- 
lizes items of high and low difficulty. It 


is also probably less reliable than the lat- 


ter indices. It is, however, quite edsy to 
compute, and for that reason may be practi- 
cable for use in a situation where only ap- 
proximate measures are desired and where 
tabulating equipment is not available. (3) 
Both indices D and E are definitely superior 
to the others. The fact that they showed a 
correlation of .85 for the 200 words stud- 
ied, in spite of the lack of complete reli- 
ability in either of them, is strong evi- 
dence that for practical purposes it matters 
little which is employed. Both are very 
difficult to compute except with the aid of 
tabulating machines. Index E, because it 
is based on the data from all pupils tested, 
is probably more reliable than Index D. 

The difficulty of arriving at any reli- 
able subjective estimate of the validity of 
a single test item makes it imperative that 
some objective, quantitative measure of 
goodness be employed in precise test con- 
struction. Very frequently test items may 
be found between which it is impossible to 
choose on a subjective basis, but which may 
be shown by objective analysis to differ 


E. F. Lindquist and W. W. Cook 





183 


very markedly in their effectiveness to dis- 
criminate between good and poor pupils. The 
words "advisers" and "adequately" may appear 
to be equally good test words, yet the first 
has been shown to be more frequently missed 
by good than by poor spellers in a random 
eighth-grade group, while the second showed 
a sharply rising line of discrimination for 
the same group. The value of Index E for 
the first was -0.12 and for the second was 
-831. It is thus possible, by selecting on- 
ly words which show a high value for this 
index, to construct a much more valid and 
reliable spelling test of a given length 
than could possibly be constructed on a sub- 
jective or random basis of selection. 

Similar applications of an index of dis- 
crimination might be found in almost any 
field of test construction. There is no 
field in which subjective evaluation is an 
infallible basis for selecting test items. 
Anyone experienced in test construction knows 
how seriously the value of a test item may 
be affected by even the slightest variations 
in structure or phrasing--variations that al- 
most invariably escape notice until they have 
been disclosed by objective evidence of the 
type presented in this article. In any field, 
there**re, where a dependable criterion meas- 
ure is available, individual test items 
should be evaluated against that criterion 
by means of an objective index of the type 
described, 

Through the systematic application of an 
objective index, and through a careful study 
of items thus proven to be effective or in- 
effective, it may be possible to arrive at 
new generalizations concerning the discern- 
ible cha,acteristics of a good test item, 
and thus provide a better basis for subjec- 
tive evaluation, Studies may be made also, 
for any individual concepts or skills, of 
the various forms in which it may be pre- 
sented in a test, such as the true-false 
statement, the incomplete statement, etc., 
for the purpose of determining the form which 
best fits that concept or skill. 

The value of an index of discrimination 
in any specific application obviously depends 
primarily upon the validity of the criterion 
measure against which the individual test 
items are evaluated. In many fields of a- 
chievement testing it is very difficult to 





i 





a 


OR tt tee as a te 





184 JOURNAL OF EXPERIMENTAL EDUCATION 


secure an independent criterion measure that 
is demonstrably superior in validity to the 
total score on the test to be evaluated. 
While, in such situations, the total score 
on the test itself may be used as the cri- 
terion, it is very important to note that 
the index of discrimination then becomes a4 
measure of reliability rather than of valid- 
ity. An index so obtained only enables one 
to select those items within the test that 
are individually most effective in measuring 
whatever the test as a whole happens to be 
measuring. If the test as awhole is not 
valid--if it happens to be measuring the 
wrong thing--then the selection or elimina- 
tion of items on this basis will only make 
the revised test a more reliable measure of 
that wrong thing, i.e., the test may become 
more reliable as it becomes less valid. 

The use of the total score on a test as a 
basis for thus evaluating its constituent i- 
tems is therefore defensible to the degree 
that the original total score provides a 
valid measure of the ability to be tested. 
If the original score is reasonably valid, 
and if only a relatively small number of 
items are eliminated, revised, or replaced 
because of low indices, then the revised 
test will usually be improved in both valid- 
ity and reliability. If, however, only a 
small proportion of the original items is 
retained (because of high indices), and if 
this limited selection is then readminister- 
ed, new total scores secured, and new in- 
dices computed, and if then the whole proc- 
ess is repeated again and again, the final 
result will inevitably be a collection of 
items which are highly interrelated because 
they are all measuring the same narrow  as- 
pect of the general function which the orig- 
inal test was intended to measure. In other 
words, repeated revision of a test on this 
basis will secure higher and higher relia- 
bility by narrowing more and more the men- 
tal function tested, 

For these reasons the “index of discrim- 
ination" should be used with particular cau- 
tion for evaluating the items within a given 
achievement test when the criterion employed 
is the total score on the test itself. If 
all of the original items, however, satisfy 








the more subjective requirements for valid- 
ity--if their content belongs in the subject 








Volume I, No, 3 


matter being tested and if they have been 
well selected and distributed with reference 


to the objectives of instruction--then the 
index of discrimination will prove extremely 
valuable for identifying or drawing atten- 
tion to those items which are functionally 
weak because of structural imperfections, 
ambiguities, the inclusion of irrelevant 
clues or determiners, or other technical de- 
fects which have previously escaped the at- 
tention of the test builder. Any revisions 
made, however, should be such that the orig- 
inal balance of content or emphasis with re- 
spect to the major objectives of instruction 
is retained, i.e., the revisions should not 
be allowed to alter the general character of 
the function tested, or the range of diffi- 
culty of its constituent items, 

Finally, it is important to note, partic- 
ularly in the situation in which the criteri- 
on is the total score on the collection of 
test items being evaluated, that individual 
test items may show low indices because of 
wrong learning, rather than because of tech- 
nical weaknesses in test construction or be- 
cause of low validity in general. To take 
an extreme example, if pupils have been 
taught to spell a word incorrectly, then 
that word will show a negative index in an 
otherwise valid test for pupils so taught. 
The same word in the same test may show a 
high positive index for another group of pu- 
pils properly taught. This is what does hap- 
pen to a limited degree in subjects such as 
history or other socia] studies where contro- 
versial issues are considered. For example, 
pupils in our public schools are often taught 
or otherwise encouraged to attribute altru- 
istic or idealistic motives to certain na- 
tional actions or policies in cases where 
the real underlying motives were undoubtedly 
of a different character. 

It may frequently happen, therefore, that 
a test item may show a low or negative index 
of discrimination, even though that item is 
structurally perfect, is keyed in accordance 
with scientifically established facts or the 
best of authoritative opinion, and holds the 
pupil responsible for an unquestionably de- 
sirable trait, attitude, skill, ability, or 
item of information. If the learning of the 
wrong response has been encouraged, then the 
better pupils in general will make that wrong 





varch, 1933 E. F. Lindquist 
response in the test more often than the in- 
ferior pupils, and the item will correlate 
negatively with the total score, thus reduc- 
ing both its reliability and its validity. 
Lack of direct instruction, inadequate 
instruction, insufficient ability, or nega- 
tive transfer from other learning situations, 
as well as wrong learning, may lead to the 
same result, particularly in recognition 
types of tests. If pupils have no real ba- 
sis for selecting the correct response, they 
may resort to "guessing," and the resulting 
chance distribution of responses will show a 
zero or low correlation with the ability 
measured by the total score on the test. 
Individual test items, then, may prove 
invalid for inclusion in an achievement test, 
even though the content of those items is 
unquestionably valid for inclusion in the 
course of study. This double aspect of the 
validity of achievement test materials is 
one which is seldom given adequate consider- 
ation. If a test is designed to express in 
terms of a single score a pupil's relative 
ability in a field of subject matter, then, 








and W. W. Cook 185 
in the evaluation and selection of the test 
items, the question of what is being learned 
is of almost equal significance with that of 
what should be learned or taught. 

The present tendency is to evaluate a- 
chievement tests on the latter basis only, 
and of course many tests appear wanting when 
so evaluated. There are many things that 
should be taught that are now given no con- 
sideration in the classroom, and other things 
that should be learned in one way that are 
now being learned in another, As long as we 
continue to use general achievement tests of 
the single score type, we must recognize the 
definite limitations placed upon the test 
builder by the present status of the curricu- 
lum and of methods and effectiveness of in- 
struction, and must avoid the mistake of 
blaming him if his test does not hold the 
pupil responsible for everything which ap- 
pears to all of us highly desirable that good 
teaching should engender. 














I. INTRODUCTION 


The product-movement coefficient of cor- 
relation developed by Karl Pearson was orig- 
inally conceived of as the slope of the line 
of regression when both variables are ex- 
pressed as deviations from their respective 
means and in terms of their respective stand- 
ard deviations as units. If X, designates 
the mean of the values of X, corresponding to 
a@ given value of Xz, the second variable, the 
equation of the line of regression may be 
written as follows: 


X, Xp 
0, “Tz o, 

The coefficient of correlation (r,,) is now 
widely used as a measure of the degree of 
relationship existing between two sets of 
paired measures or between the two variables 
represented by them. Coefficients of cor- 
relation appear in large numbers in our more 
technical educational journals and are to be 
found in many educational texts. When one 
consults texts on educational statistics con 
cerning the meaning to be associated with a 
given coefficient he is told that the values 
of a coefficient of correlation cannot be 
greater than +1.00 or less than -1.00; that 
& positive coefficient is evidence that the 
larger magnitudes in one set of data tend to 
be paired with the larger in the other, and 
likewise the smaller magnitudes in one set 
of data tend to be paired with the smaller 
in the other; that a negative coefficient is 
evidence of inverse pairing, the larger mag- 
nitudes in one set tending to be paired with 
the smaller ones in the other; that the mag- 
nitude of the coefficient is indicative of 


186 JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No. 3 


THE INTERPRETATION OF THE COEFFICIENT OF CORRELATION 
Walter S. Monroe 
University of Illinois 
and 
Dewey B. Stuit 
Graduate Student, University of Illinois 


the completeness of this pairing, being com- 
plete when r = +1.00; and that when the co- 
efficient is 0.00 the pairing is on the basis 
of chance and no relation exists between the 
two sets of measures, Values of r between 
0.00 and 1.00 indicate the existence of a 
relationship between the two sets of paired 
measures or the variables represented by 
them, and obviously there is some type of 
correspondence between the magnitude of the 
coefficient and the degree of relationship. 
But educational statisticians have given 
scant attention to the degree of relation- 
ship to be associated with particular numeri- 
cal values of r, such as .18, .30, .50 or 
75. 

In 1917 Rugg! suggested the following gen- 
eral interpretations: r less than .15 to 
20, correlation "negligible" or "indiffer- 
ent"; r from .15 or .20 to .35 or .40, cor- 
relation "present but low"; r from .35 or $0 
to .50 or .60, correlation "marked"; ra- . 
bove .60 or .70, correlation "high". A gen- 
eral classification of this type is mislead- 
ing because the meaning of the terms used 
varies with the type of data being consid- 
ered. Coefficients calculated from the 
scores obtained from the administration of 
two forms of a test are, other things being 
equal, higher than those calculated from in- 
telligence test scores and measures of si- 
lent reading ability; these are higher than 
those summarizing the relationship between 
high school marks and those received in col- 
lege; and these in turn are higher than the 
coefficient for IQ and measures of teaching 
success, Hence a coefficient of .50 would 
be very high for IQ measures of teaching suc- 
cess, slightly below average for high school 
and college marks, low for the relation 





1. R. 0. Rugg, Statistical Methods (Boston: Houghton Mifflin Company, 1917), p. 256. 





——_- ar 


March, 1933 


between intelligence test scores and meas- 
ures of Silent reading ability, and very low 
for the reliability of a test. 

Odell! gives the ranges of reported co- 
efficients of correlation for several types 
of data. For example, for first and second 
applications of a standardized group test the 
range is given as .60 to .90. Sucha table 
is only partially satisfactory because the 
magnitude of the coefficient obtained from 
paired measures of two variables depends up- 
on the range of the population represented by 
the data. If the population is selected from 
a single school grade the coefficient of cor- 
relation between mental age and achievement 
will be less than if it were calculated from 
a@ population representing a sequence of two 
or more grades. In other words, for meas- 
ures of two traits the greater the standard 
deviations of the distributions of these 
measures, the larger the calculated coeffi- 
cient of correlation.© Hence a comparisonof 
coefficients of correlation, even when they 
have been obtained from measures of the same 
traits, is likely to be misleading unless the 
standard deviations of the distributions of 
the measures are taken into account, 

In approaching a consideration of the in- 
terpretation of a coefficient of correlation, 
two points should be emphasized: (1) the 
calculation and the interpretation of the 
product-moment coefficient of correlation im 
ply certain assumptions in regard to the da- 
ta from which it is calculated; (2) the in- 
terpretation of a coefficient must be made 
with respect to a definite population suchas 
an age group or 4 grade group, and when the 
population for which the interpretation is 
desired is not the same as that from which 


W. S. Monroe and D. B. Stuit 








187 


the coefficient has been calculated, the val- 
ue of r must first be adjusted to the de- 
sired population. 

The assumptions underlying the calcula- 
tion of the product-moment coefficient of 
correlation can best be understood if one 
thinks of the usual correlation table. The 
first and most important condition is that 
the distribution and frequencies of the en- 
tries in this table exhibit a rectilinear re- 
lationship rather than one that is curvilin-e 
ear. If the relationship is rectilinear the 
line formed by connecting the means of the 
columns, or of the rows, will approach a 
straight line rather than a curve. When the 
line connecting the means suggests a section 
of a parabola or some other curve, the prod- 
uct-moment coefficient is not an appropriate 
technique. When there is any uncertainty 
concerning the nature of the relationship the 
Blakeman test® should be applied and if a 
satisfactory degree of linearity is not shown 
the product-moment coefficient should not be 
used, 

The second requirement is homoscedasticity 
which means that the variabilities (standard 
deviations) of the several arrays (columns 
and rows) of the correlation table must be 
equal. Small departures from equality, how- 
ever, do not seriously affect the coefficient 
of correlation as a measure of relationship 
and for most educational data the degree of 
homoscedasticity will be satisfactory, pro- 
vided the correlation is linear, 

The requirement that the correlation sur- 
face4 be normal has frequently been assumed 
to be a necessary prerequisite, but accord- 
ing to Kelley this is not true. When the de- 
partures from normality are such, however, 





1. C. W. Odell, Educational Statistics (New York: 





The Century Company, 1925), p. 178. 


2. It is, of course, necessary that other conditions remain the same. 
3. The Blakeman test for linearity is based on the difference between the correlation ratio (n) and the coefficient of 


correlation (r). 


The expression to be evaluated may be written 
N@m* - 


r®) 


When the relationship is perfectly linear the value of this expression is zero, and as the relationship becomes curvi- 


linear the value of the expression indicates the degree of departure from linearity. 


Although there is not complete 


agreement between authorities, the product-moment coefficient probably can be safely used when 
N(n® - r*) < 16.38 


A more conservative limit is 11.57 and it probably is wise to use this limit when N is not large. 
Statistical Methods for Students in Education (Boston: 


of Blakeman's test see K. J. Holzinger. 
1928), p. 185. 


For a longer form 
Ginn and Company, 





. The term "correlation surface" implies representation in three dimensions, the frequencies in the correlation table 


being represented as vertical distances. 
rows, form normal distributions. 


The correlation surface is normal when all of the arrays, both columns and 








PEE terre 


By ae gene See eonlin 


-_ 


a , S 


aig: 
ae 






Oe ee a ee 


. 


188 JOURNAL OF EXPERIMENTAL EDUCATION 


that a few pairs of measures affect the co- 
efficient of correlation to such an extent 
that their omission would materially change 
the calculated value, it is apparent that the 
obtained coefficient should not be regarded 
as a dependable measure of the relationship. 
The situation is analogous to the use of the 
mean as a measure of the central tendency of 
a distribution in which a few extreme meas- 
ures cause the difference between the mean 
and the median to be relatively large. 

The fact that frequently the interpreta- 
tion of the coefficient is desired for & pop- 
ulation other than that from which it has 
been calculated is responsible for much con- 
fusion and misinterpretation. Obviously in 
such cases it is necessary first to calcu- 
late or estimate the coefficient of correla- 
tion for the desired population. Precise de- 
terminations are seldom possible, but the 
available techniques may be described for the 
following cases, 





Coefficient of Corre- Interpretation Desired 
lation Calculated From For 





l. The infinitely large 
population or the uni- 
verse of which the ob 
served populationisa 
random sample 


1. A random sample 


2. A heterogeneous pop- 2. The corresponding pop- 
ulation ulation homogeneous 
with respect to one 
or more specified 
traits or phenomena. 


3. A sample population 3. Corresponding unse- 
selected in a sys- lected population, 
tematic manner such the correlation sur- 
that the correlation face being normal! 
surface is normal. 





It should be'noted that these three cases 
do not include all situations in which the 
value of r for a specified population is de- 
Sired. Frequently, probably usually, the 
sample from which a coefficient has been cal- 
culated is neither random nor selected in a 
Systematic manner such that the correlation 
surface is normal. Hence the techniques to 
be described for the first and third cases 








Volume I, No, 3 


will yield only an approximation of the val- 
ue of r for the desired population. 

The magnitude of a coefficient of cor- 
relation is affected by the presence of vari- 
able errors© in the data from which it is 
calculated. Variable errors are usually 
present in educational data, However, if 
the interpretation of a coefficient is to 
be made in terms of the measures used rather 
than in terms of the traits they represent, 
the fallibility of the measures is not a mt- 
ter of concern, Teachers' marks are know 
to be fallible measures of achievement but 
if the coefficient of correlation calculated 
from such data is to be interpreted in terms 
of the relationship between the two sets of 
marks the only errors to be considered are 
the accidental ones that may have occurred 
in copying the marks from the school records 
or in handling them. If, however, the in- 
terpretation is to be made in terms of the 
relationship between the achievements repre- 
sented by the marks, their reliability mst 
be considered. 

When one has specified the traits and 
population for which interpretation is de- 
sired and has calculated or estimated the 
value of r for them, there remains the prob- 
lem of interpretation. The correlation 
technique is employed in dealing with several 
types of problems and the meaning of a co- 
efficient depends upon the question whose 
answer is being sought. Three types may be 
recognized: (1) When a regression equation, 
derived from 4 correlation table, is used as 
a formula of prediction, how accurate are 
the predictions? (2) When r is a coefficient 
of reliability, what is the magnitude of the 
variable errors of measurement? (3) Given 
two sets of paired measures what degree of 
relationship between them does 4 given value 
of r designate? 

Before considering the interpretation of 
a coefficient of correlation with respect 
to these questions, the methods of calculat- 
ing or estimating r for the desired measures 
and population will be considered. In 





1. A special case of the third is created when the selection is on the basis of "range of talent," usually measured in 
terms of grade range. For example, the coefficient of correlation is calculated from a fifth-grade population and 
the interpretation is desired for the range from the fourth to eighth grade. This case may also be stated in reverse 


order. 


2. A variable error is a chance error. For an unselected group of measures, the mean of the variable errors approaches 
sero. Hence this class of errors is to be distinguished from a systematic error. 





varch, 1933 


connection with the consideration of the meth- 
od of estimating the probable coefficient of 
correlation for the universe from the value 
of r obtained from a random sample, inciden- 
tal attention was given to the question, "In 
the universe is there any relation between 
the two traits or phenomena?" 


II. CALCULATING OR ESTIMATING r FOR 
DESIRED MEASURES AND A DESIRED POPULATION? 


The value of r for true measures. The ef- 
fect of variable errors in the measures is to 
decrease the numerical value of a coefficient 
of correlation. This effect is called atten 
uation. Several formulae have been proposed 
for correcting a calculated coefficient for 
attenuation. The following are probably most 
frequently used, 





r. 


: 12 
Vr VReq 


(I) 





Tw@ 


VTi gr "ie 


VMir Te w 


Two 


In both of these formlae fr,,.. designates 
the corrected coefficient, the subscripts 1 
and I two independent measures of the first 
trait or phenomenon, and the subscripts 2 
and II two independent measures of the sec- 
ond trait or phenomenon. The name “reliabil- 
ity coefficient" is applied to r,, and rp,,;,. 
When using these formlae it is necessary to 
keep in mind that the reliability coeffi- 
cients must be for the same population as the 
other coefficients. It should be noted also 
that the correction obtained by applying one 
of these formulae is only for variable er- 
rors of measurement. No allowance has been 


W. S. Monroe and D. B. Stuit 








189 


made for variable errors of validity or for 
variable errors due inequalities in the units 
of the scale of measurement. 

The probable value of r for the universe 
or an infinitely large population of which 
the Gata collected are a random sample. If a 
large number of large“ random samples were 
drawn from a universe or infinitely large 
population, the coefficients of correlation 
calculated from them, if not approaching 
+1.00, would form 4@ normal distribution and 
the mean of the distribution would be the 
most probable value of r for the universe. 
When the value of r has been calculated for 
a known or assumed large random sample the 
median deviation (probable error) of this 
distribution is usually calculated by the 
formla> 











_ 18745 (1 - r?) 


PE, Vn 


The mean of the distribution (value of r 
for the universe) cannot be determined but 
the probable limits within which it would 
fall may be calculated from the obtained val- 
ues of r and PE,. The chances are 1 to 1 
that it lies between r - PE, and r + PE;}; 

4.6 to 1 that it lies between r - 2PE; and 
r+ 2PE,;; and so on. For example, if ry, = 
e50 and PE, = .03, the chances are 1:1 that 
the difference between .50 and the mean is 
not more than +.03. In other words the 
chances are 1:1 that the position of thedis- 
tribution is such that the mean is between 
47 and .53. The chances are 4.6 to 1 that 
it is between .44 and .56 and 22 tol that 
it is between .41 and .59, 

It should be noted tnat this procedure is 
applicable only when the data from which r 
has been calculated, form a sample of the 
universe that is both random and large. 





(IID 


In 





1. It is, of course, assumed that the conditions on which the calculation of the coefficient of correlation is based are 


satisfied. 


2. The term, "large," used as descriptive of the size of a sample has no precise meaning, but the reader may interpret 


it as signifying a number larger than 50. 
&. The more correct formula is: 


One hundred is a more conservative mininun. 


.6745 (1 - r*) 





Vu-2 


Mordecai Ezekiel, of tion 


- (New York: 


John Wiley and Sons, 1950), p. 256. 


When a coefficient of correlation has been corrected for attenuation the correct formula varies with the one 


used to secure correction for attenuation. See T. L. Kelley, Statistical Method. (New York: 


1928), p. 210 f. 


The Macmillan Company, 


? 























































190 JOURNAL OF EXPERIMENTAL EDUCATION 


educational research it is seldom that the 
universe is sampled by a random process and 
hence the randommess of the sample is usual- 
ly uncertain. Failure of the correlation 
surface to approach the conditions of normal- 
ity is evidence that the sample is probably 
not completely random. Hence when the cor- 
relation surface exhibits marked irregulari- 
ties, especially when the elimination of a 
few extreme pairs of the measures would m- 
terially affect the obtained value of r, the 
values obtained for the probable limits of r 
for the universe should be regarded as only 
approximations. 

As N becomes small the mean of the dis- 
tribution of the values of r obtained from 
a@ large number of similar random samples 
tends to be greater than the coefficient of 
correlation for the universe. Hence in such 
a case the method described above for cal- 
culating the probable limits of the value of 
r for the universe will not give accurate de- 
terminations, The error will not be large, 
but the careful worker should not tolerate 
even small errors. When N is less than 30 
Fisher's method of determining the signifi- 
cance of correlation coefficients obtained 
from small samples or Ezekiel's simpler mod- 
ification of Fisher's method should be used, 
Ezekiel gives the following formula for cal- 
culating a corrected coefficient (fF) when N 
is small. 





N-1 
N-2 





= 2 
Fo tV1l- (l-7r;,) 


The corrected coefficient obtained by means 
of this forma is @ more reliable estimte 
of the coefficient for the universe. Ezekie 
gives a chart for determining the "probable 
minimum" correlation existing in a universe, 
defining this "probable minimum" as that for 
which the chances are 19:1. For N = 50, and 
Tig * -27, the "probable minimum" is zero. 
This is somewhat more conservative than the 
result obtained by the usual method which 
would give the odds as approximately 22:1. 

A positive coefficient, even one that is 
numerically small, denotes, a positive or di- 
rect relationship between the two sets of 


Volume I, No, 3 


notes an inverse relationship. Hence, if 
the value of r for the universe should not 
have the same sign as the calculated value, 
the meaning would be reversed. A calculated 
coefficient of correlation is called statis- 
tically significant when the probability is 
very slight that the coefficient for the uni- 
verse would be zero or have the opposite 
Sign. It is a common practice to designate 
a coefficient as statistically significant 
when r is greater than four or five times PE£.. 
Examination of Ezekiel's chart shows that 
this is not a very sound practice when N is 
less than 50. 

When the obtained coefficient approaches 
1.00, the probable error of sampling may be 
misleading. For example, if the obtained co- 
efficient, as a result of the operation of 
chance, is 1.00, the probable error computed 
by the formula is zero, and yet the obtained 
coefficient may not be the value of r for 
the universe. Furthermore the curve of er- 
ror for coefficients near 1.00 will be 
skewed, Since the true coefficient cannot 
be greater than 1.00 the distribution of the 
values of r obtained from a large number of 
similar random samples cannot extend beyond 
this limit. Hence the probable error form- 
la should be used with caution when the val- 
ue of r approaches +1.00. 

The determination of the statistical sig- 
nificance of a coefficient of correlation 
represents one type of interpretation. If it 
is found that the "probable minimum" coeffi- 
cient for the universe has the same sign and 
does not approach zero, the evidence may be 
interpreted as meaning that in the universe 
there is at least a slight degree of rela- 
tionship between the two traits or phenomena. 
It should be noted, however, that the proce- 
dure for determining the "probable minimum" 
requires that the measures from which r has 
been calculated be a random sample of the 
universe. Hence determinations of statisti- 
cal significance should be made with caution. 
Unless the sample is large and the assump- 
tion of its random character appears justi- 
fied, the label of "statistically signifi- 
cant" should not be applied when the coeffi- 








paired measures. A negative coefficient de- 


cient is only four or five times its probable 





1. R. A. Fisher, 


Esekial, bey Po , ° 
2. Esekiel, bey Po 121. 
5. Esekiel, t. » Pe 305. 





(Edinburgh: Oliver and Boyd, 1925), p. 159-62. 





Merch, 1935 


error aS commonly calculated. Many writers 
have judged a coefficient to be statistical- 
ly significant when a critical examination of 
the evidence would reveal that the sample 
probably was not random, a 

The value of r for a corresponding popu- 
lation homogeneous with respect to one or 
more specified traits. A population is het- 
erogeneous with respect to a given trait 
when there are individual differences inthis 
trait. For example, a typical grade groupis 
heterogeneous with respect to chronological 
age. When a population is heterogeneous with 
respect to a trait whose correlation with 
either or both of the two traits being con- 
sidered is not zero, this condition tends to 
make the obtained value of r larger than 
would be obtained from the corresponding hom 
ogeneous population. For example, if the in- 
quiry is with respect to the correlation be- 
tween scores on an arithmetic test and scores 
on a Silent reading test, the value of the 
obtained coefficient will be determined in 
part by tne degree of heterogeneity of the 
population with respect to general intelli- 
gence, Other things being equal, the great- 











er the range of intelligence, the larger the 


coefficient of correlation. Hence the coef- 
ficient of correlation will be smaller for a 
population that is homogeneous with respect 
to a variable that contributes to the vari- 
ability of the observed population. It may 
be zero or even negative, 

The effect of heterogeneity of the popu- 
lation is illustrated in Table I, at the top 
of the next colum, The data from which the 
coefficients of correlation were calculated 
were obtained from the pupils of a single 
school system. For each grade group the co- 
efficient for mental age and chronological 
age is negative. This, of course, is in a- 
greement with our general observations. The 
brighter pupils in any grade are the young- 
er ones; the older pupils are generally the 
duller ones, When all grades are combined, 
however, the coefficient of correlation is 
.56, which indicates a distinct positive re- 
lationship between mental age and chronolog- 
ical age when all of the grade groups are 
combined into one population. This again is 
in accord with our general observations, 
Hence, we have an illustration which empha- 
sizes that the coefficient of correlation may 





W. S. Monroe and D. B. Stuit 


TABLE I 


EFFECT OF NATURE OF TRAIT AND SELECTION OF 
POPULATION GROUP UPON COEFFICIENT OF COR- 


RELATION 
No. 
of 
Grade Pu- 


pils 





Mental 
Age and 
Achieve- 
ment Age 


Chrono- 
logical 
Age and 
Achieve- 
ment Age 


Achieve- 
ment Quo- 
tient and 
Intelli- 
gence Quo- 
tient 


Chrono- 
logical 
Age and 
Mental 
Age 





Arithmetic 
Arithmetic 
Arithmetic 





-.04 
-.14 
-.18 
-.02 
-.13 
-.01 
-.O1 
-.16 
-.19 
--17 
-.06 
-.15 


-.12 
-.20 
-.30 
-.35 
-.31 
-.40 
-.44 
-.40 
-.31 
-.33 
-.40 
-.37 


3B | 314 
3 268 
4B | 449 
4A | 241 
SB | 454 
SA | 255 
6B | 426 
GA | 224 
7B | 399 
7A | 201 
8B | 372 
8A | 214 


All Grades 











- 56 -.10 51 























be materially affected by the grade-range of 
the population from which the data are col- 
lected. 

The coefficients for achievement quotient 
and intelligence quotient are negative. In 
the case of arithmetic these are relatively 
large negative quantities. In fact, theyare 
Slightly larger than the ones for chronolog- 
ical age and mental age. When all of the 
grade groups are combined both of the coef- 
ficients of correlation are negative, and 
differ but little from the average of the 
coefficients of correlation for the separate 
grade groups. Here we have a distinct re- 
versal of the condition illustrated in the 
third colum, In the case of both the in- 
telligence quotient and the achievement quo- 
tient the mean is practically the same for 
@ll grade groups. The standard deviations 
are also essentially the same, Not only are 
they the same for all grade groups, but when 
the groups are combined approximately the 
same averages and standard deviations are se- 
cured. We are in this case dealing with 
traits which do not increase from grade to 
grade, but tend to remain essentially con- 
Stant. In the case of chronological age and 



























= 






* Seek unt 
= 
a i 


4 . 
——=- 
« > oe 
2 





































eT Se 

a ir ee | ee te Se ae Ces 
” So i ee ae 72 = ree == “4 

£5 — neers . 


mental age the traits increased from grade to 
grade. In other words heterogeneity with 
respect to grade location of pupil affects 
the coefficient of correlation in one case 
and does not in the other. This is because 
the correlation between grade location and 
chronological age and mental age is not zero 
while in the case of the quotient measures 
the correlation approaches zero. 

If the population from which the coeffi- 
cient has been calculated involves the same 
degree of heterogeneity as the population for 
which the interpretation is to be made, no 
attention need be given to this factor when 
r is interpreted as an index of the accuracy 
of prediction. For example, a grade group, 
such as the pupils classified as seventh 
grade, is heterogeneous in respect to chron- 
Ological age, mental age, and certain other 
characteristics, but if the interpretation 
of r is to be made for seventh-grade pupils 
in general, the degree of heterogeneity may 
be assumed to be approximately the same, pro- 
vided, of course, that the group of pupils 
tested is typical of this grade. If, however, 
the coefficient has been calculated from 
tests administered to pupils in grades four 
to eight or any other sequence of grades and 
it is desired to interpret or use the calcu- 
lated r as an index of the concomitant vari- 
ation (correspondence) within a single grade 
range, it is necessary to allow for the dif- 
ference in heterogeneity. 

It may be emphasized in this connection 
that only a limited meaning can be attached 
to a coefficient of correlation as a measure 
of the degree of relationship without con- 
Sidering the absolute heterogeneity of the 
population from which it has been calculated, 
Very frequently coefficients of correlation 
are misleading because of the failure to 
recognize a factor of heterogeneity. 

Partial correlation my, under certain 
conditions, be used to eliminate the spuri- 
ous correlation due to heterogeneity of the 
population, For example, the correlation be- 
tween depth of chest and mental age fora 
large group of gifted boys ranging in age 
from 9 to 14 years has been reported as 
+.582,1 The variable chronological age may 
be regarded as a common cause of both mental 


192 JOURNAL OF EXPERIMENTAL EDUCATION 


Volume I, No. 3 


age and depth of chest. Its correlation 
with depth of chest has been reported for 
this group as +.618 and its correlation with 
mental age as +941." Let the symbol x, rep- 
resent depth of chest, Xg represent mental 
age, and xg, represent chronological age. The 
partial correlation formula for three vari- 
ables is 


— Tie ~ TisTes 
7 V1 - rf, V1 - Th 


Inserting the values given above we have 





(IV) 





r 582 = 618 p 2941 
25 1 - e1e® Vi - .947* 


= +,002 








The value +.002 indicates that in the cor- 
responding population homogeneous with re- 
spect to chronological age the relationship 
between depth of chest and mental age is 
practically zero. 

In the same source, the intercorrelations 
standing height (x,) and mental age (xg) and 
chronological age (xg) are given as follows 
for the same group of gifted boys, 


Tie = +,835, Ts = +,845 and Ths * +941 


-835 - .845 x 941 
Ren * 
185 “V1 = .645® V1 - .041? 
= +,220 








The value +,220 indicates a significant 
relationship between standing height and men- 
tal age for the group of gifted boys when 
the common cause, chronological age, has been 
held constant, 

The partial correlation technique will 
fail to give the value of r for a completely 
homogeneous population if one or more vari- 
ables contributing to the heterogeneity of 
the population are not considered, For ex- 
ample, if the variables whose relationship 
is being studied are represented by x, and 
Xe, Xs, Xq and Xs, elimination of xg only, 





because of ignorance of the fact that the 





2. Ibid., P- 156 and Pp. 168. 


1. L. M. Terman, ot al. Genetic Studies of Genius, I (Stanford, California: Stanford University Press, 1925), p. 168. 





March, 1933 W. S. Monroe and D. B. Stuit 193 


zero-order correlation between x, and xg is | arithmetic contribute to the determination of 
partially due to variables x, and xs willnot | grade location, these results probably should 
yield the net correlation between variables be interpreted as illustrating the effect of 
x, and Xg. The partial correlation coeffi- partialling out too much, 
cient ry,., is likely to be larger than In the case of the fourth grade group the 
Tyoesar OF Treesas? Nd Phe.sg iS likely to coefficient of correlation for reading and 
be larger than Irjp..s¢s5- An investigator is | arithmetic scores was found to be .55. Em- 
likely to neglect to partial out a sufficient | ploying the partial correlation technique to 
number of variables when he is ignorant of hold constant the factor of age, the size of 
the fact that the variables he is studying the coefficient increased to .617. Coeffi- 
owe some of their gross relationship, i.e., cients of correlation for reading and arith- 
the relationship indicated by the zero-order | metic scores were also computed for pupils 
correlation between distributions of meas- of the fourth grade who were of the same age 
ures of them, to the influence of other vari-| --that is, coefficients between reading and 
ables with respect to which the groupis het- | arithmetic scores were calculated for pupils 
erogeneous, of age eight, nine, ten, eleven, and twelve 
The partial correlation technique elimi- | separately. Weighting each coefficient by 
nates too much when the variables whose re- | the number of pupils for which it was com- 
lationship is being studied affect the vari- | puted, the results for the different ages 
able partialled out either directly or indi- | were averaged and a coefficient of .618 ob- 
rectly through a more remote variable. Itis | tained. This is almost identical with that 
difficult to name a variable other than obtained by the partial correlation technique, 
chronological age which may be partialled out In his discussion of correcting coeffi- 
with reasonable assurance that too much is cients of correlation for heterogeneity of 
not being eliminated. Even the variable the data, May* Suggests the use of the fol- 
chronological age must be partialledout with | lowing formula: 
care if the age range is greater than two or 
three years.° T2102 = Ta,mp On, Ony 
In a study made by the junior author test r e ° ; ; ; 
scores for silent reading and arithmetic were 5182 \ 0 - On) Vo - One 
secured for a population consisting of ap- 
proximately equal groups from grades four to | The symbols in the above formula have the 
eight inclusive. When grade location was following meanings: 
partialled out the correlation between arith- Tipe is the correlation between the two 
metic scores and silent reading scores was variables when all the homogeneous sub-groups 
found to be .36. The mean of the coeffi- are combined into a single scatter-diagram, 
cients for the separate grade groups was .40. 
Since achievement in silent reading and 


1. Usually, but not necessarily, since the elimination of variables yielding negative zero-order correlations with one of 
the variables studied may result in the obtaining, of a partial correlation coefficient of greater magnitude than the 
zero-order coefficient, or partial correlation coefficient of a lower order. If the variable eliminated correlates 
negatively with both of the variables studied, the partial correlation coefficient is smaller than the zero-order co- 
efficient. For example, Tryp = +7019, ry, = -.5459, rps = -.5528 and rjo.g = +.6605. 

2. B. S. Burks, "Statistical Hazards in Nature-Nurture Investigations," Twenty-Seventh Yearbook of the National Society 
for the Study of Education, Part I (Bloomington, Illinois: Public School Publishing Company, 192 ), pp. 12-18. 


See also: 
B. S. Burks, "On the Inadequacy of the Partial and Multiple Correlation Technique," Journal of Educational Psychology, 
XVII, (Novembex and December, 1926), pp. 532-40, 625-50. 

. If the age range is more than two or three years, the regressions of age on the variables studied are likely to be 
significantly curvilinear. Dunlap and Cureton suggest that where this condition is found, chronological age may be 
eliminated satisfactorily by partialling out as separate variables chronological age, chronological age squared, 
chronological age cubed, and so on. They state that it is usually unnecessary to go beyond the cubes, and in a great 
many cases not beyond the squares. J. W. Dunlap and E. E. Cureton, "On the Analysis of Causation," 
tional Psychology, 21: 657-80, December, 1950. Dunlap and Cureton refer in this connection to the work of Fisher 
on time series - +5 are essentially similar to chronological age. 

R. Ae Fisher, Sta fi Wo (Edinburgh: Oliver and Boyd, 1925), p. 172. 
Mark A. May, "A Method for Correcting Coefficients Correlation for Heterogeneity in the Data," Journal of Educa- 


tional Psychology, XX (September, 1929), pp. 417-25. 


(Vv) 





































ee Se 






fe ee 

























ee eS Sree 









. . . 
oA ay amet 









a =—_ 
RE 6 Rey ee 


NN Me 





Ta,a, 18 the correlation between the means 
of the subgroups when each group is weighted 
by its number (N). 

0, and Og are the standard deviations of 
the entire population. 

On, and Og, are the standard deviations of 
the means of the subgroups. 

8,8, are the standard deviations of the 
E's,&, being the deviation from the mean of 
a subgroup. 

Te,£, 18 the correlation between the vari- 
ables corrected for the effect of heteroge- 
neity. 


The computation of this correlation involves:. 


(1) Dividing the data into homogeneous sub- 
groups, finding the mean of each subgroup, 
and recording the number of cases in each, 
(2) Computing the correlation between the 
means of the subgroups and the standard devi- 
ation of each set of means, (3) Finding the 
correlation between the variables when all 
the groups are thrown into one scatter dia- 
gram, 1 

In his derivation of the formla, May 
shows that the Yule formula for partial cor- 
relation is but a special case of the one 
just described. May lists several advantages 
of his formula over the partial correlation 
techniques among which are the following: 

1. It may be used when heterogeneous fac- 
tors cannot be measured. Such an element as 
race cannot be measured quantitatively and 
hence cannot be partialled out by the Yule 
formula; by formla (V) its effect on corre- 
lation may be determined, 

2. The Yule formla makes the assumption 
that the regressions of variables 1 and 2 on 
3 are linear and that the means of the sub- 
groups are on these lines, Forma (V) does 
not make either of these assumptions, 

3. Any number of heterogeneous factors 
may be handled in one operation, depending 
upon how the subgroups are selected, They 
may be narrowed down to any degree of homo- 
geneity desired or that the data will permit, 

4. The subgroups may be made up in any 
way desired, 

5. The formla may be turned around and 
written in two more ways, 


194 JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


TE, E, 5182 + Te) 2%m) Ome 
r. _— 
= 2 2 2 2 
8, +O, 82 + Ons 


M2915 - TEL Ee 51 S2 
Taja2 = (Vb) 


Vot - sf Vos - 83 


Applying May's formula to the data men- 
tioned above, the junior author obtained for 
the total population (grades four to eight 
inclusive) the following estimated coeffi- 
cients from the coefficients for each of the 
separate grade groups, .830, .850, .650, 
-767, and .778. The calculated value is .75, 
These findings suggest caution in applying 
May's forma, 

The value of r for a specified population 
when r for a non-random selection from this 
population is known, When r is known fora 
non-random selection from a population, the 
value of r my, under certain conditions, be 
estimated for the desired population. For 
the case in which the variability of one set 
of measures has been curtailed (restricted) 
in the selection of the data and the vari- 
ability of the other set of measures has 
been modified only consequently Kelley* has 
developed a forma that is applicable pro- 
vided the curtailment is made in such a way 
that the rectilinearity of the correlation 
and the homoscedasticity of the second vari- 
able are maintained. If 0, represents the 
standard deviation of the curtailed distri- 
bution of the first variable,2,, the stand- 
ard deviation of this variable in the de- 
Sired population, and ry» and Ry», the co- 
efficients of correlation for the correspond 
ing populations, the formla is 

21 
Ne5 " 


Rie = = 
212 
Vi > ré. + rs (G 7% 


The assumptions upon which this formla 
is based limit its application but it shows 
the effect of selection in one variable upon 
the resulting coefficient of correlation. If 
y:04 > 250 and Ne = 400, Rie - 658. 





(Va) 























(VI) 





1. May, cit., p. 422. 
2. T. te Ealley, 
formula earlier. 


(New York: The Macmillan Company, 1925), pp. 224-25. Pearson developed the same 





Warch, 1933 WwW. S. 


Pearson! has developed a formula that 
gives the relation between the value of the 
coefficient of correlation for the total pop- 
ulation and the corresponding value for the 
selected population when selection is in 
both variables. When the correlation sur- 


faces of both the selected and the unselect- 
ed populations are normal, Pearson's formla 
may be written as follows: 


2 2.22 2 2.2 
(1-Rip ) 172 + 40) 09Rie 





-(1-Ri, =". + 





Ne ° 201 02Rie (VII) 
In this forma R and LC refer to the total 
population and r and o to £ the selected pop- 
ulation, If the standard deviations of 
the total population£, and&X», are knowm 
the value of Rip can be calculated. 

The practical application of this formla 
is somewhat limited because of the require- 
ment of normal correlation surfaces for both 
populations, For purposes of illustration 
it was applied to data obtained by adminis- 
tering two tests to groups of pupils in 
grades four to eight inclusive. By means of 
the above forma the coefficient of corre- 
lation for each grade group was estimated 
from the one obtained for the total groupand 
the coefficient of correlation for the total 
population was estimated from those for the 
separate grade group. The former are shown 
in Colum 2 of Table II and the latter in 
Colum 3. With the exception of the sixth 
grade, the formula gives rather satisfactory 
results. The failure of the formula to work 
in every instance can be attributed to the 
fact that the subgroups are not representa- 
tive with respect to each other and the to- 
tal group. It is apparent that applications 
of this formula should be made with caution. 
Unless the assumptions underlying its deri- 
vation are known to be satisfied, the re- 
sulting coefficient should be considered 
only a rough estimate, 





Monroe and D. B. Stuit 


TABLE II 


RESULTS OBTAINED BY THE USE OF FORMULA 
FOR DOUBLE SELECTION 





1 2 3 4 

r Estimated|Total R Esti-/Original 

From Entire|mated From Coeffi- 
Group Grade Groups cients 


Grade 





-32 
48 
«20 
45 
-55 


VIII -74 
VII -70 
VI 35 

V ° -82 

IV 82 
Entire 
Group -- 75 

















For the correlation between true measures 
of intelligence and true measures of achieve- 
ment in a population made up of several grade 
groups Kelley® has developed a formla that 
may be written as follows: 


oe of (1 = my ) 
oi 

In this formula the capital letters refer to 
the wide-range population, and the small let- 
ters to a Single grade group. The subscript 
A designates achievement and I intelligence. 
The derivation involves the assumptions that 
Og, Ox, and ray are the same for the several 
grade groups entering into the wide range 
population and that the difference between 
the means of the successive grade groups is 
constant, 

For coefficients that are measures of the 
reliability of a test Kelley* has devised a 
formula by assuming that the variability 
(0,.2) of the variable errors of measurement 
is the same for the narrow range of talent 
as it is for the wide range. Hence 


0.2 = Vl -nr=2 1.2 = Vil - Ri 
o V1l-Ry 





Rar = (VIII) 


(IX) 





1. Karl Pearson, "On the influence of double selection on the variation and correlation of two characters," Blometricka, 


VI (1908), pp. 111-12. 


"On the general theory of the influence of selection on correlation and variation," Biometricka, VIII (1912), pp. 


457-43. 


2. This form of Pearson's formula has been derived by James Verne, formerly a student under T. L. Kelley from the form 


given by Se Se ee Oe ee 
Harvard Studies in Education, Vol. 18 (Cambridge: 


5. T. Le Kelley, Interpretation of Educational Measurements (Yonkers-on-Hudson, New York: 





p. 202. 


See, Bancroft Beatley, Achievement in the Junior High School. 
Harvard University Press, 1952), p. 43. 





World Book Company, 1927), 


4. T. L. Kelley, "The Reliability of Test Scores," Journal of Educational Research, III (May, 1921), pp. 570-79. 














{ 
; 
| | 
{ 
: 
) 


ae 


dim 





2 hr 


taal a Seat ead ee ee 


196 JOURNAL OF EXPERIMENTAL EDUCATION 


Solving for ry; we have 


o0*®-0? (1 = Ry) 


Tir = 62 (X) 





This formula also involves the assumption 
that variable errors on the two testings to 
determine reliability are uncorrelated. Hol- 
zinger has called attention to the doubtful 
validity of this assumption, Hence the for- 
mula should be used with caution, 


III. THE INTERPRETATION OF r AS AN 
INDEX OF ACCURACY OF PREDICTIONS 


When predictions of values of one vari- 
able are made from values of a correlated 
variable by means of a regression equation 
of the form 

0, 
X = ne (Xp = Me) + M 
Oe 


the standard deviation of the errors of es- 
timate is calculated by means of the forma, 


2 
Oj.2 = 0, Vl - Me 


The standard error of estimate (0j..)) 
gives the magnitude of error in the esti- 
mates that will be exceeded in approximately 
one-third of the cases, The significance of 
this magnitude may be arrived at by compar- 
ing it with the standard error of estimte 
when the predictions are "pure guesses."* In 
this case Mp = O and o).g = 0,. (The let- 
ter "g" is used as a subscript to designate 
that the estimates are "pure guesses.") The 
ratio of the obtained standard error of es- 
timate to the standard error of estimate of 
pure guesses is easy to interpret, but a 
more meaningful expression can be obtained 
by taking the ratio of the difference of the 
two quantities to the standard error of es- 
timate of pure guesses, 


O1.g - O1ee = 1 - 01 Vl - rf 





Volume I, No. 2 
This difference may be thought of as the 


amount of the reduction in the standard er- 
ror of estimate from that for pure guesses, 


The ratio 
4/ 2 
0, - O7 1- Tie 


O71 





is the per cent of reduction, or the improve- 
ment in the prediction that has been accom- 
plished by using the regression equation as 
a prediction formla, We my, therefore, 


write 
+/ 2 
0%, -% V1l-Tie 


O1 
= 100 (1 - V1 - ri) (XI) 


The symbol, Ip, is read, "improvement over 
pure guesses" or simply, “improvement over 
chance."° The factor, 100, is inserted as a 
multiplier to enable us to omit the decimal 
point when the value of \1l- rf, is express- 
ed to only two places. Since the right-hand 
member of this formula involves only me it 
offers a means of interpreting a coefficient 
of correlation in terms of the accuracy of 
prediction. By substituting values of rjpo 
in the formula 


I, = 100 (1 - V1 - rip) 


the corresponding per cents of improvement of 
the predictions over pure guesses may be ob- 
tained. These per cents are given for sev- 
eral values of rye in Table III on the fol- 
lowing page. For ry, = .50, the per cent of 
improvement is only 13. This statement 
means that when r =.50 prediction made by 
means of the regression equation will in- 
volve a standard error of estimte only 13 
per cent smaller than the standard error of 
estimate of pure guesses. For Mp = .866, 
the per cent of improvement over pure guess 
is 50, 

In the preceding discussion the predic- 
tions have been judged by comparison with 





Ip = 100 





1. K. J. Holsinger, Statistical Methods in Educ: om (Boston: 


Brown and G. 8. Thomson. 


wg Ree a 1928), pp. 251-54. See also William 
( Harvard University Press, 1921), p. 158 f. 


+ Predictions made as "pure quences" saat tuab 6 Glotatiallan Eamstne 6 epentthed cheno. The formula for the standard 
error of estimate assumes that this shape is approximately normal. Furthermore, the guesses must be made in a per- 


fectly random manner. 


5. The symbol E is sometimes used and is read "efficiency of prediction." 





March ’ 1933 W . 


TABLE III 


PER CENT BETTER THAN PURE GUESS OF PREDICTIONS 
OF FALLIBLE MEASURES FOR VARIOUS VALUES OF ry 





Per Cent Better 
Than Pure Guess 


Per Cent Better 


r 
Than Pure Guess 12 





50 
5l 
53 
94 
56 
59 
61 
63 
66 
69 
72 
76 
80 


86 
100 


-866 
87 
-88 
-89 
99 
91 
292 
- 33 
94 
95 
- 96 
97 
98 
99 
1.00 


-00 
-50 




















the obtained measures of the criterion--e.g., 
predicted marks would be compared with the 
marks actually received, We may also com 
pare the predictions with true measures of 
the criterion. If X, represents the falli- 
ble measures of the criterion, X, my be 
used to designate the true measures of it. 
The error of estimate determined by compar- 
ing the estimate, X,, with the true measure 
is X,- 4, and the standard error of the es- 
timates is given by the formula 


1/ 2 
G-2 * Toa 1- Too 
Since 0, = 9) \/ Piz 


Tie 


and Yro.=— 
VT 


Substituting these values in the equation 
for the standard error of estimate and sim- 
plifying gives 


9 ’ as rie 


@+?2 Oj 


Ty 
We may also write 


Ine = 100 (1 - Vry, - Mi) (XII) 


P@ 
It is apparent from this formula that 
when the reliability of the measures of the 

criterion (rm, ) is low, say .70, the im- 


S. Monroe and D. 


B. Stuit 197 
provement of the predictions over chance is 
materially creater than when the predictions 
are compared with the fallible measures of 
the criterion. If the coefficlents of core 
relation between scores on an aptitude test 

and the marks received in a course (r),) is 
-60, the improvement of the prediction of 

the marks is 


I, = 100 (1 - V1 - .602) 


P 
20 


If the 
marks 


coefficient of reliability of the 
(ry, ) is .70, 


> of ot. ie 


= 100 (1 = .58) 
42 


Hence the efficiency of predicting rvrks 
is 20 per cent better than pure cuess, while 
the efficiency of predicting "estimated true" 
marks (the true criterion) is 42 per cent 
better than pure guess. 


IV. THE COEFFICIENT OF CORRELATION AS AN 
INDEX OF THE VARIABLF ERRORS OF MEASUREMENT 


An interpretation of a coefficient of 
reliability without recognition of the range 
of talent represented in the data from which 
it was computed will be misleading if not 
erroneous. This consequence can be avolded 
by employing a measure of the variability of 
the differences between the obtained scores 
(X,) and the corresponding theoretical true 
scores (X,). The median deviation of the 
differences (X.=- X,), which is properly call- 
ed the probable error of measurement, is 
given by the formla® 


PE} .@ = .6745 0, Vl - Tuy 


If 6, and o; are not approximately equal 
the following should be used, 


(XIII) 








l. The symbol @ designates "infinity." Hence Iq designates the means of an infinite number cf measures, none of which 


involve a systematic error. 
2. The use of this formula implies the assumption that 
late them, would form a normal distribution. There is 


Such a mean is called a true measure. 
the variable errors of measurement, if it were possible to iso- 
also the assumption that these errors are uncorrelated with 











198 
Oo, + O 

PEy.g = 6745 —V1-Fy = (XITIa) 

01 + O01 


For a given population, i.e., when 
has been determined, r,;,; may be inter- 
preted in terms of the corresponding median 
deviations of the variable errors of measure- 
ment. Unless the value of 6, is known, it 
will appear as a factor in this median devi- 
ation. Table IV gives for various values of 
rj; the corresponding values of the probable 
error of measurement. 




















TABLE IV 
VALUES OF COEFFICIENT OF RELIABILITY (ry ) 
AND CORRESPONDING VALUES OF PROBABLE R 
OF MEASUREMENT .6745 0, V1 - ry ° 
(rj7)| -6745 9) W Felt.) -6745 0) V1 - ry, 
450 -48 0} 90 -21 0) 
-55 | 45 0) 91 20 0] 
-60 | +43 9} 92 «19 0) 
.65 | 40 0} .93 18 0) 
~70 | .37 0) 94 17 93 
75 | 34 9) 095 15 oO} 
-80 | 20 0 -96 -13 0} 
.32 | «29 0) .97 +12 0) 
84 | +27 0} 98 +10 03 
- 86 | +25 0) -99 -07 a) 
88 | 23 0} 1.00 -00 07 
V 


« THE INTERPRETATION OF r AS A 
MEASURE OF RELATIONSHIP 


The meaning of relationship. The inter- 
pretation of a coefficient of correlation as 
a measure of the relationship existing be- 





JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No, 3 


tween two variables is based upon the ti..sis 
that the correlation is due to what is com 
mon to them. When ry), # O, the two variables 
from whose paired values ry), has been calcu- 
lated may be analyzed as follows: 


X, = (34 + Dy 
Xp = Cea + De 


In these equations c, and Ce are con- 
stants; b, and be are variables which are un- 
correlated with each other and with a; and 
a is a variable but for any pair of values 
of x, and xg its megnitude is the same or if 
the nature of xX, and X2g is such that a com 
mon element is not reasonable, a designates 
two variables that are perfectly correlated, 
The units of measurement may be chosen 60 
that either c, or cg will be equal to l,. 

It is apparent that the magnitude of nr, 
varies with the relative magnitude of a, As 
@ approaches zero, ry, approaches zero. When 
a is relatively large, i.e., when b; and by 
are relatively smill, the value of ryp ap- 
proaches +1.00. Hence if a quantitative re- 
lationship can be established between rj. 
and the relative magnitude of a, we willhave 
a means of interpreting the coefficient of 
correlation in terms of the degree of rela- 
tionship between the two variables, 

The ratio of a to x; or Xp, is not con- 
stant but varies from measure to measure. 
Hence neither nor 7 would be satisfac- 
tory as a description of the existing rela- 
tionship. The standard deviation of a given 
population of measures is constant. Hence 
Og» O, and Op are constants and the ratio of 
6, to oO, or to oO, might be used as a quanti- 
tative description of the relationship. For 





(Footnote Continued) 


the scores. The latter assumption is only approximately true. The larger errors appear to be found in the smaller 


scores and the smaller errors in the larger scores. 


This probably means merely that the pupils on the lower levels 


of the ability being measured are more erratic in their performances than those on the higher levels. However, the 
degree of correlation between the variable errors and the test scores is not high and the formula may be accepted as 


pproxima t. 
’ ae emo "an ae ets of Errors in Mental Measurement," Journal of Educational Psychology, XIV (May, 1925), 


pp. 278-88. 
1. For proof of this theorem, gee: 
T. L. Kelley, 
statement of Kelley's Proposition 2 is not 


of (Stanford, California: 
tical with that given above, but the proof is applicable. (it 


Stanford University Press, 1928), p. 58. The 


be noted that all variables are expressed as deviations from their respective means. 








warch, 1933 


reasons that need not be given here it is 
more satisfactory to use the variance ratio! 


2 
°s, Since Xi, = C,a + db) and of = clots ob, 





the variance ratio is the per cent of the 
variance of x, that is "due" to the variance 
of a. Hence the relationship between two 
variables may be defined as the per cent of 
the variance of the variable considered as 
the dependent one that is "due" to the vari- 
ance of the common element.® (See further 
consideration of the meaning of the variance 
ratic on page 201.) 

The relation between the co ic 
correlation and the retio of 0, to oj. By 
means of the following argument it is demon- 
strated that 





2 
a 

~a Ss: e 

4 

, The maximum value of the variance ratio 


6. 

— is attained when c, = Cp = 1 and Sb = doe 
7 

Under these conditions 


o% 


r =” 6 
12 of 


The minimum value (r7,) is attained when 


Cl, = Cg = 1 and be = 0. For the other values 


W. S. Monroe and D. B. Stuit 





199 


of Ci, Ce, by and be the value of the varie 
ance ratio lies between r,, and Piss 
The proof of these statements is as fole 
lows: * 
The coefficient of correlation for the 
variables x, and Xp may be written: 
UX Xe 
T = 
= NO 109 


Lc a + b) (cea + be) 





NO,05 


Dc cga® + cade + ceaby + bbe) 





NO) 0p 
C1Cpba® + cbabe+ cebab, + Hbibe 


NOj0p 





Since it was postulated that b, and beare 
uncorrelated with each other and with a, the 
summation of products involving b, or be are 
equal to zero, Hence, 


C1 Cypha” 


NO) 02 


Tie = 


Da? 


Since =— = of this equation simplifies to: 


C1C,0% 


ina Se (XIV) 


Tie 
019g 





2 


l. The square of a standard deviation of a distribution is called its variance. The ratio® may be used, but when the 


2 
relationship is functional, x] is generally thought of as the dependent variable and in such cases the ss is 


the appropriate gne to use. 2 
. The proof that of = coe + O%, is as follows: 


2 


at | 


1 


Pa cya + b,)® 


_ Lefa® + 22cyab, +O} 





N 
Since a and by are uncorrelated the product 2 cjab; is equal to zero, hence 


ej La® 


Dee 





qo 


= oh te, 


. A somewhat different interpretation of relationship may be made by assuming that a trait is made up of ea number of 


equally important and equally probable elements. 


Then the coefficient of correlation between two sets of paired meas- 


The interpretation of a coeffi- 


ures may be defined in terms of the per cent of elements common to the two traits. 
cient suggested by this concept is not the same as that given here. The validity of the basic assumption is not cer- 
tain and hence it is omitted except for this brief mention. For an account see P. H. Bygaard, "Interpretation of Cor- 
relation on Basis of Common Elements," of Educational Psychology, XXIII (November, 1952), pp. 576-85. 

4. The reader who is not interested in this proof may omit to the next parag aph heading. 















Sans 7 : “Pt ROE neon 
ag ert ce nie or ge epee nites: 


— = Li aoe ve ws 


SO I RE RR 


ee ee 











~ 


i en Se Oe ee ee, ee 





200 JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 
The postulate in regard to a, bj, Dp, c, | and . . 
and cp enable us to write! Oa 8 
Tle => 2 (xvI1) 
of = clo. + oy, O, + Op, 07} 
2 
of = cho, + Ob, O6 
The fraction — >> or its equivalent 
Substituting for 6, and o, in Equation XIV we | o? On + OD, 


have 




















Ci5,_ —Ctis C206 
Tie * 
Vetoz + of, Vekozob, 
Squaring both sides: 
; cf oz cgo2 — 
T. * ° 
= cfo? + of foe + oF 
2 2 
= Oa ° Ca (XVa) 
~ 
of +t ok + 
ce ce 


There are two conditions under which the 
right-hand member of this equation can be 
simplified. If be = O and c, = cp = 1” 80 
that 

x, = 8+ by 


Xp = 4 


then the second fraction of the right-hand 
member of Equation XV becomes equal to unity 
and 





. * on (XVI) 
Tip sos" gs 
of + ob, of 
If x, is known to be a function of xp, this 


case may be described by saying that x», con- 
tributes itself completely to x. 


If Op, = Op, and Cc, = Cg = 1 
Os Os o% 


. = oo 
of + of, of + of, of + of, 





Tie = 


3 is the ratio of the variance of a, the 
1 


element common to both x, and Xg, to the va- 
riance of x,, and may be interpreted as the 
per cent of the variance of x, that is as. 
sociated with the variance of the common ele- 
ment. In the first case, this ratio is 
given by rs, and in the second by rip. 
Referring to Equation XV it is apparent 
that if of, > 0, the second fraction of the 





right-hand member is < l. Hence 
: fot efo% 
Tir < . . ; * r (XVIII) 
C)}0, + On, O77 


When oy, # ob,, one fraction in the right- 


hand member of Equation XVa will be less 
than the other and the replacement of the 
larger by the smaller will give a product 
less than r?,. Assume the first is smaller 
than the second,° Then 


clo, = hog 


cio, + oF, 


2 
cfoe 
>| ————— 
cio, + OF, 


SH OCA 





Tie 
é 2 
C10, + On 





Tr. > = (XIX) 
cioe + ob, of 

Since the units of measurement may be chosen 
so that c; = 1, we my write 


O85 
Tie > > vie 





1 





1. For proof see footnote 2, page 199. 


2. If be = 0, the units of measurement can be chosen so that c) =cp = 1. It should be noted that variable errors of 
measurement are included in the b's. Hence wmless rjp has been corrected for attenuation, bp will not be equal to 


sero, 
5. This assumption involves no restriction in the proof. 





March, 1933 


The 





The meaning of the variance ratio. 


2 
meaning of the ratio = depends upon the type 
%7 
of relation that exists between the two va- 
riables x, and x». If they are functionally 
related, i.e., if Xp is a cause of x,, this 
ratio gives the per cent of the variance of 
x, that is due to the variance of x». If the 
two variables are not functionally related, 
the ratio is merely the ratio of the vari- 
ance of (02) to the variance of x,. Since 
the fact that r,, #0 is not evidence of the 
existence of a functional relationship, the 
first of these meanings is appropriate only 
when a functional relationship has been dem 
onstrated by other means or the assumption 

of such a relationship is made explicit. 

It should be noted that the variance isa 
function of the population and hence that 
the interpretation must be restricted to the 
population for which r,, is the coefficient 
of correlation. It has been proposed by 
McCall, Kelley, and others that in the case 
of educational measurements an unselected 
group of twelve-year-old children be used 
as a standard population. Kelley has pre 
sented data in support of the thesis that a 
population made up of equal unselected 
groups from six consecutive grades is ap- 
proximately equal to that of an unselected 
age group. In order to estimate the coeffi- 
cient for a standard population by means of 
forma VIII® it is necessary to know = : 
Oy 
Kelley has estimated the values of this ra- 
tio for the Stanford Achievement Test and an 
intelligence test. When such a table of 
values is available the coefficient of cor- 
relation may be estimated for the standard 
population, Employing this procedure Kel- 


ley* found that for a complete age popula- 
2 


tion the value of “Sts approximately .90 
1 


when x, represents scores yielded by a bat- 
tery of achievement tests and x» represents 
the scores on a verbal group intelligence 


W. S. Monroe and D. B. Stuit 








201 


test. He interprets this finding as the 
per cent of community of function of such 
tests. Literally it means that ratio of the 
variance of the common element to variance 
of the obtained scores is .90 in the case 
of this population. 


VI. APPLICATIONS 


The conclusions reached in the preceding 
sections are matters of prime importance for 
the investigator who employs correlation a- 
nalysis in studying his data. This statis- 
tical technique is one of the most valuable 
ones at the command of the research worker, 


| but its value ina particular case is de- 


pendent upon the interpretation given to the 
calculated coefficient. Little has been ac- 
complished until the meaning of the coeffi- 

cient has been ascertained, 

A basic general principle is that the in- 
terpretation of the coefficient of correla- 
tion must be made with reference to a speci- 
fied population. If the population for which 
the interpretation is desired is not the 
same as that from which the coefficient of 
correlation has been calculated, estimating 
the value of r for the desired population is 
a@ preliminary phase of the interpretation. 
Formulae are available for making such esti- 
mates, but they yield accurate results only 
under certain conditions. Hence they should 
be employed with caution, especially by the 
investigator who does not thoroughly under- 
stand correlation analysis. In general, es- 
timated coefficients should not be thought 
of as precise determinations, 

Correlation analysis is used for two pur- 
poses: (1) As a means of ascertaining wheth- 
er within a specified population two paired 
traits or phenomena are related, and (2) as 
@ means of measuring the degree of relation- 
ship that does exist. It should be noted 
that correlation analysis does not reveal the 
nature of the relationship-—-whether it is 
one of cause and effect or merely one of con- 
comitant variation. The determination of 
the nature of the relationship mist be sought 
by other means. 





1. T. L. Kelley, Interpretation of Educational Measurement. 
p. 197 f. 

2. See page 195. 

5. Ibid, p. 202. 

4. Ibid, p. 208. 


(Yonkers-on-Hudson, New York: World Book Company, 1927), 








oe <a Mercy ~ ee 4 =, 
ee “SF v 
at St ye = “0 
> = " oo Pall 7 » i | 
— kh 6 oa" v = ~ 
a . . 
. - b —s ~ wees 
- . — : ie 


~ 


= = 
oe Aletiad 


<< & Fe 
a ane 
+ wea mes 


i 


ae : 
Sarma 


tediieate 
* 
ee. + | 


7 os! — ~ - 
. 

wipinie seri gure Re toured Gtintd pie dad ate ee 

c. “ ee mene 


to — 
a BPA 


$ 
ee sesowats 


my Tagen gems 
“ 7 
te ne 


aT ~~ —— : 


=o 


202 JOURNAL OF EXPERIMENTAL EDUCATION 


A value of r not equal to 0.00 is evi- 
dence of the existence of a relationship be- 
tween the two sets of paired measures within 
the population to which the coefficient ap- 
plies. Of course a very smll value ofr 
such as .V1 or .05 indicates only a very 
Slight degree of relationship, and, if the 
number of cases is not large and the assump- 
tions underlying the calculation of the co- 
efficient are not fully satisfied, this in- 
terpretation should be made with caution. 
Frequently the question relates to the exis- 
tence of a relationship in the universe or 
infinitely large population of which the 
data from which r has been calculated isa 
sampl.e. When the sample is large and ran- 
dom the formula for the probable error of a 
coefficient of correlation (see page 189) 
may be used to calculate the probable limits 
of the coefficient of correlation for the 
universe. The use of the usual formula when 
the sample is not large, i.e., when N is 
less than 100 or when the sample is not 
known to be approximately random is not sound, 
When N is less than 50 or the sample is not 
approximately random, the determination of 
the statistical significance of the coeffi- 
cient of correlation by the usual procedure 
is likely to be in error. Many of the de- 
terminations of statistical significance in 
our educational literature are erroneous, 

When the value of r is desired for a pop- 
ulation homogeneous with respect to one or 
more specified factors or when it is desired 
for a population of which the data represent 
a non-random sample partial correlation and 
the other statistical techniques should be 
employed with caution. All of this group 
of formulae are based upon certain assump- 
tions and if these assumed conditions are 
not approximated, the estimated coefficient 
should be regarded as only a rough determi- 
nation, In general inexperienced workers 
should not employ partial correlation and 
the other formla given on pages 192-197. If 
they do, they should keep in mind the un- 
certainty of their findings. 

When the coefficient of correlation has 
been calculated or estimated for the speci- 
fied population, the procedure to be em- 
ployed in interpreting r in terms of the de- 
gree of relationship depends upon the nature 





Volume I, No.3 


of the question whose answer is desired. If 
the question relates to the value of one va- 
riable as a means of predicting correspond- 
ing measures of the other, the investigator 
should calculate the standard or probable 
error of estimate or the per cent of improve- 
ment over chance prediction or pure guess 
(see page 196, and Table III, page 197). The 
mere statement of the coefficient of corre- 
lation, €.g., Tj, = -60 will be misleading 
to the uninformed reader. He probably has 
read that a coefficient of .60 is to be re- 
garded as "high" and consequently will inter- 
pret it as meaning that fairly accurate pre- 
dictions can be made when as a matter of fact 
they are only 20 per cent better than pure 
guesses, 

When the coefficient of correlation is a 
measure of the reliability of the scores 
yielded by an educational test the interpre- 
tation should be in terms of the probable 
error of measurement, It is a well-knowm 
fact that a coefficient of reliability is ma- 
terially influenced by the range of talent 
represented by the data from which it is 
calculated. Hence a coefficient of reliabil- 
ity has little meaning apart from the range 
of talent. On the other hand the probable 
error of measurement is not materially in- 
fluenced by the range of talent. In fact, it 
is generally considered to be independent of 
the range of talent. 

When the question of the degree of rela- 
tionship does not relate to either the ac- 
curacy of predictions or the variable errors 
in test scores, the variance ratio may be 
used as a means of interpreting the coeffi- 
cient of correlation. This ratio represents 
a concept that is new in the field of educa- 
tional research and the investigator should 
make certain that he understands its meaning 
when he attempts to use it. It is the ratio 
of the variance (standard deviation squared) 
of the factor common to the two variables to 
the variance of the variable taken as the de- 
pendent one. Suppose for a given population 

2 
What does this mean? In the 


a 
0? 
case of intelligence test scores and achieve- 


Tie = -80 e 


ment test scores Kelley calls it a measure 
of the “community of function" or the "per 





1. T. L. Kelley, "Interpretation of Educational Measurements." (Yonkers-on-Hudson: World Book Company, 1927), p. 202 f. 





march, 1933 


cent of intelligence that is achievement and 
the per cent of achievement that is intelli- 
gence." This meaning, however, is not lit- 
erally true. If the variance ratio is .80 
it does not follow that 80 per cent of a 
pupil's scores on an intelligence test and 
an achievement test consist of the same 
thing or that 80 per cent of a pupil's a- 
chievement is really intelligence. In fact 
the ratio of the common element to the to- 
tal score is certain to vary from pupil to 
pupil. What a variance ratio of .80 does 
mean is that in the case of a given popula- 
tion if the factor common to intelligence 
test scores and achievement test scores 
could be separated out and assembled in a 
frequency distribution the ratio of the 
square of the standard deviation of this dis- 
tribution to the square of the standard de- 
viation of the distribution of scores would 
be ,.80. In other words the variance ratio 
is a function of the distribution of the com- 
mon element of the distribution of the de- 
pendent variable, 

It should be noted that the value of the 
variance ratio is influenced by the popula- 
tion. Hence the interpretation must always 
be made with reference to a particular popu- 
lation, Other things remaining the same the 
greater the range of the population the 
greater the value of the ratio. Hence vari- 
ance ratios from dissimilar populations 
should not be compared and it is desirable 
that all variance ratios should be reduced to 
the basis of a standard population such as a 
complete age group. 

The magnitude of a coefficient is influ- 
enced by the character of a population from 
which it is calculated. Even when popula- 
tions bear the same general designation as 
in the case of grade groups there may be sig- 
nificant variations especially if the popu- 
lations are not large. Hence any interpre- 
tation based upon a comparison of coeffi- 
cients should be made only after the inves- 
tigator has inquired carefully into the char- 
acter of the populations involved. The use 
of such terms as “low,” “high,” and "very 
high" imply comparisons and the difficulty of 
making meaningful comparisons suggests that 
the use of terms should be avoided. 


W. S. Monroe and D. B. Stuit 





203 


The preceding suggestions are applicable 
to the readers of reports of research as 
well as to those engaged in carrying on re- 
search, If the author fails to make clear 
the population to which a coefficient of cor- 
relation applies, the reader should seek to 
identify that population. The statement that 
the coefficient of correlation between read- 
ing test scores and arithmetic test scores is 
-65 is indefinite if not meaningless. It can 
have a meaning only when referred to a par- 
ticular population. The reader should keep 
in mind that the usual determination of sta- 
tistical significance is valid only in the 
case of large random samples. Hence, many 
determinations of statistical significance 
need to be discounted, some of them rather 
heavily. In any case the meaning of statis- 
tical significance should be kept in mind, 

When a coefficient of correlation is given 
as @ measure of the accuracy of prediction 
the reader should translate the value of r 
into the per cent of improvement over pure 
guesses, In this connection it is helpful 
to remember that a coefficient of .60 repre- 
sents an improvement of 20 per cent over pure 
guesses and that an improvement of 50 per 
cent is obtained when the coefficient of cor- 
relation is approximately .86. 

Since a coefficient of reliability isa 
rather meaningless measure of the accuracy 
of test scores, the reader should seek for 
the probable error of measurement. If the 
author has not given it, the reader should 
endeavor to calculate it. When this is not 
possible the reader should consider the coef- 
ficient of reliability in relation to the 
range of talent in the population from which 
it has been calculated. Other things being 
equal the greater the range of talent the 
lower the degree of accuracy indicated by 
the coefficient of reliability of a given 
magnitude. In other words the coefficient of 
reliability of .85 obtained from a single 
grade group designates rather accurate test 
scores, but if it has been derived from a 
group consisting of a sequence of four or 
five grades it designates scores involving 
rather large variable errors of measurement, 





JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 2 


ON THE ACCURACY WITH WHICH RELIABILITY MAY 
BE MEASURED BY CORRELATING TEST HALVES 
William A. Brownell 
Duke University 





Durham, North Carolina 


The three following statements are taken 
from as many texts! on educational measure- 
ment which have appeared since 1928, All 
three quotations bear upon the same problem, 
namely, the correlation of test-halves as a 
means of measuring test-reliability. One 
text says: 

"There are three principal methods of 

obtaining a coefficient of reliability. 

The first of these is....; the second, 

eeee; and the third method is to com- 

pare the chance halves of the scores of 
each pupil on a single application of 
the test...." 

"Without any further knowledge of 
statistics, a coefficient may be com- 
puted from this model.... In using the 
chance-halves method, the sum of the 
scores earned on the even items for 
each test may be entered in column 2, 
the sum of the odd items for each test 
in column 3, and the remainder of the 
computation done as indicated...." 

Another text states: 

"Another method, used when there is 

only one form of the test, is to com- 

pute separately the scores obtained on 
the odd-numbered items and the scores 

on the even-numbered items, Thus the 

test is divided into halves, the scores 

on which may be correlated. By a com- 
paratively simple formula the correla- 
tion between these halves is then cor- 





rected for the length of the test in 
order to determine what the correlation 
for the entire test would be." 

And the third text says: 
"A second method of obtaining this meas- 
ure is correlating the scores of the 
same pupils made on even items of a 
sincle test with the scores made on the 
odd items of the same test. It is nec- 
essary to carry out certain statistical 
procedures of oogrection if the second 
method is used.” 


These quotations, it should be noted, are 
complete. That is to say, they represent 
all the reader will find in these texts con- 
cerning the method of correlating test- 
halves, except insofar as the authors mention 
certain general cautions which apply to all 
types of reliability coefficients. No word 
of warning is uttered with specific reference 
to this one method of estimating reliability: 
none of the assumptions underlying the method 
and none of the special difficulties encoun- 
tered in its use are pointed out. The read- 
er, in the absence of such information is 
Clearly justified in employing the method 
practically without qualification and with 
full confidence in its adequacy. 

In view of the uncritical manner in which 
the split-halves method of determining test 
reliability is set forth in these texts, as 
well as in others which might be cited> it 
may not be inappropriate to present certain 





1. The titles and page references have been omitted for obvious reasons: the argument of this paper is wholly imperson- 
al and, as such, is not directed against particular authors or particular textual treatazents. 


2. Reference here is, of course, to the use of the 


Formula, the use of which is explained later 


in this text, as it is also in the case of the two other texts which have been quoted. The general formula is written 


"ll = 


“3d 
RET: 


As it applies to stepping-up coefficients obtained by correlating test-halves r); refers to the desired coefficient 
for the test as a whole, refers to the coefficient for the two halves. 


5. In order to determine the degree to which the three above quotations could be regarded as typical, an examination 
was made of the treatment accorded this topic in a total of fifteen texts in educational measurement (listed on the 
following page), only two of which were published before 1927. On the basis of their treatment of the correlation 





W. A. 


March, 1933 


empirical data which illustrates some of the 
dangers involved in correlating test-halves. 
In the sections which follow, a report is 
given of an attempt to measure the reliabil- 
ity of a series of short objective tests by 
the split-halves technique. As perhaps the 
simplest and the most direct form of exposi- 
tion, the steps in the study are outlined in 
the order of their occurrence, and in con- 
nection with each step the data secured are 
summarized. 


ORIGIN OF THE PROBLEM 


The original test employed in this inves- 
tigation (designated as Test I, Form A) con- 
sisted of 32 true and false statements and 
was based on the seven-pace treatment of the 
Recapitulation Theory in Hollingworth's "Men- 
tal Growth and Decline."! ‘he reliability 
of the test was determined by correlating 
scores on odds with scores on evens for 55 
college students of sophomore and junior 


Brownell 





2C= 


grade. The reliability coefficient© was 
found to be .41. When given to the same 
students ten days later to measure retention, 
the test was again found to have a reliabil- 
ity coefficient of .4]1 by the split-halves 
method, 

Inasmuch as the test was to be used again, 
an attempt was made to increase its reliabil- 
ity. Accordingly, three types of changes, 
frequently recommended for this purpose, were 
made: (1) Four new questions were added; 
(2) twelve of the original questions were re- 
worded to remove ambicuity; and (3) the or- 
der of questions was altered to approximete 
somewhat more closely a scale. These revi- 
sions, it was anticipated, should raise the 
reliability coefficient several points, 

The new 36-item test (designated as Test 
I, Form B) was given to 112 different stu- 
dents under conditions similar to those that 
prevailed when Form A was given. The relia- 
bility coefficient proved to be .35, actual- 
ly lower instead of higher than the original 





(Footnote continued) 


of test-halves, these fifteen texts fall into three rather distinct groups. 
is barely mentioned, and the correlation of test-halves is mentioned not at all. 


(1) In six texts the topic of reliability 
(Yet, curiously enough, the authors 


of these texts give every evidence of expecting teachers to be intelligent in selecting standard tests and to be ef- 


ficient in constructing their own informal tests.) 


(2) In another group of six texts reliability is discussed at 


some length as a desirable characteristic of a test, and the method of measuring reliability by correlating test- 


halves is illustrated. 


It is to this group of texts to which the three cited above belong. In the other three texts 


in this group the treatment of this method of measuring reliability is as uncritical as the discussions quoted. The 
only limitations recognized as applying to the reliability coefficient obtained from test-halves are those which apply 


to other types of reliability coefficient also. 


(3) In the three remaining texts varying degrees of emphasis are giv- 


en to the fact that before test-halves may properly be correlated, there must be assurance that the test-halves are 


"comparable," or "random selections for the whole test," or "approximately equal." 


One of the three texts emphasizes 


the "contingency" of all reliability coefficients; another specifically states that the method here under discussion 


"cannot be universally applied." 


Inadequate as are these three last treatments, none of which undertakes to explain 


what is meant by "comparability" and "equality" and "equivalence," nor shows how these characteristics are to be rec- 
ognized and secured, they are the most satisfactory to be found among the fifteen texts. Incidentally, it may be 
stated that the most complete treatment is one of the earliest. 


The texts which were examined are the following: 


A. R. Gilliland, R. H. Jordan, and Frank S. Freeman, Educational Measurements and the Classroom Teacher (New York: 





The Century Company, 1951). 


Harry A. Greene and Albert N. Jorgensen, The Use and Interpretation of Educational Tests (New York: 


tation of Educational Measurements (Yonkers—on-Hudson: 
s in Written Examinations (Boston: 


and Company, 1929). 
Truman Lee Kelley, In 
Albert R. Lang, ern Me 
I. N. Madsen, Educati 
Walter S. Monroe, An Introduction to 
C. W. Odell, Educational Measurement in 


at emen 





Charles Russell, (New York: 
Henry L. Smith and Wendell W. Wright, Tests and 


of Educational Measurements (Boston: 


Longmans, Green 


World Book Company, 1927). 

Houghton Mifflin Company, 1950). 

World Book Company, 1950). 
Houghton Mifflin Company, 1923). 

The Century Company, 1950). 

(Yonkers-on-Hudson: 


(Yonkers-on-Hudson: 


World Book 


pn Foresman Company, 1929). 


1 Instruction (Yonkers-on-Hudson: World Book 


Ginn and Company, 1950). 


(Newark, N.J.: Silver, Burdett and Company, 1928). 


Percival M. Symonds, t in Education (New York: The Macmillan Company, 1927). 
Ernest W. Tiegs, Boston: Houghton Mifflin Company, 1981). 
Guy M. Wilson and Kremer J. Hoke, How to wensure revised, New York: The Macmillan Company, 1950). 


H. L. Hollingworth, 
This coefficient, as 


Bey gg mie 
as other cients to be reported as 


Appleton, 1927), pp. 206-15. 
"reliability coefficients" for their respective 


tests, represents the result of correlating +est-halves without stepping up by the use of the Spearman-Browm formula. 


The fact that the coefficients are not stepped up explains their small size. 


The use of the term "reliability coef- 


ficients" to describe these measures may properly be criticized on the ground that they are not such unless the as- 


sumptions underlying the Spearman-Brown formule ere met. 
ient expression. 


The term will, however, be used in the interest of conven- 





“" 





f 


206 JOURNAL OF EXPERIMENTAL EDUCATION 


coefficient. When used later to measure re- 
tention, the test showed a reliability of 

-32. With two other groups of students the 
corresponding coefficients were .32 and .26. 


THE INVESTIGATION PROPER 


Further use of Form B. Mention has al- 
ready been made of the fact that in deriving 
Form B the order of questions in Form A was 
altered. As a result, the two halves of 
Form B (odds vs. evens) were differently ca- 
stituted from the corresponding halves for 
Form A. It seemed possible that there might 
be some relation between the composition of 
the test-halves and the changes in the reli- 
ability coefficients.! To test this hypoth- 
esis new test-halves were artificially pre- 
pared by distributing the 36 questions in 
various ways. Scores on these halves were 
then calculated from the papers of the 58 
students who had taken Form B as originally 
constituted. And finally, the reliability 
coefficients were computed on the basis of 
these scores on the artificial halves, 

The data relevant to this part of the 
study are summarized in Table I. As show 
in the last colum, the coefficients range 
from .20 (row 7) to .48 (row 6). 

Four facts should be emphasized at this 
point. First, it should be borne in mind 
that the only change made was in the organ- 
ization of the test-halves. Subject matter 
remained constant, as did also the reactions 
of the students. Second, all these coeffi- 
cients are supposed to measure the same 
thing, namely, the reliability of Form B. 
That is to say, they might be expected to 
measure the same thing if one knew no more 
about the split-halves method of determining 
reliability than one is able to learn from a 
study of twelve of the fifteen texts on 
measurement which have been cited. But it 
is hardly conceivable that the same test, 
given to the same students, can have a re- 
liability such that it is equally well ex- 
pressed by coefficients of .20 and .48, 

Third, it is impossible to explain away 
the discrepancy between .20 and .48 on the 











Volume I, No, 3 


TABLE I 


DATA ON TEST I, FORM B; 36 TRUE AND FALSE STATE. 
MENTS, 58 CASES 





Composition of halves as 
shown by distribution of Tie 
questions 


Nature of halves 





1,3,5,7, etc., v8. 2,4, .32 
6,8, etc. 


1. Odds vs. evens 


2. Halves equalized 1,4,5,6,8,10,12,13,14, 29 
for difficulty 16, 19, 20, Be »29,30,31, 82, 
and 36 vs. £ 95,7,9,11, 
15,17,18,21 » 23,24, 25,26, 


27,28,33,34 and 25 


3. Hard half vs. 1,3,7,8,9,19,12,13,15,  .34 


easy half 21,22, 26 ,27 29, 80, 31,32, 
end 26 vs. 4,5,6,11, 
14,16,17,18, ste? 20,23, 24, 
25,28,32,34 and 35 
4. Same as 1 above, Odds vs. evens except 222 


except two pairs that questicns 15 and 16, 
of items changed and questions 32 and 34 
were interchanged 


5. Same as 1 above, Odds vs. evens except 35 
except three that questions 1 and 2, 
pairs of items 4 and 5, and 18 and 19 
changed were interchanged 


6. Same as 1 above, Odds vs. evens except 48 
except four that questions 1 and 2, 
pairs of items 4 and 5, 17 and 18, and 
changed 21 and 22 were inter- 

changed 


7. Chance halves, 1,2,3,4,5,6,10,17,19, 220 

items drawn from 23, 26 ° 27,28 *20, 33,34, 

a hat 35, and 36 vs. 7,8, 9, 
11,12,13,14,15,16 »18, 
20,21,22,24,25,29,31, 
and 32 


8. Chances halves, 1 g2yS_4 »5,6,7,8,10,16, 30 
items drawn from 18, 19,22,25,26,27,32, 
a hat 35, and 36 vs. a ay 
12,12,14,15,17, Steaee 
22,24,28, 29,30,31,32, 
and 34 


Range +20 to .48 


*The r's in this column (and in the corresponding 
columns for Tables II to V inclusive) have not 
been "stepped-up." P.E.'s for the r's range 
from .068 (for .48) to .087 (for .20). 

basis of chance in sampling. It is true that 

the gross difference of 28 points is only 

Slightly more than three times the P.E. of 

20, but such an argument, even though it is 

based on an accepted means of testing the 

reliability of differences in coefficients, 

simply does not apply here. For P.E. of r 











l. The decreases in the coefficients were not due to a lessening of the variability of the scores in the different groups 
of students. The S.D.'s of the different distributions agreed fairly closely. 
2. Again attention is called to the fact that these coefficients do not represent the reliability of the test as a 


whole. Corrected by the 


Brown formula the range would be from .55 to .65, the latter being a rather re- 


Spearman 
spectable reliability coefficient for a 56-item true-false test. 


March, 1933 


is designed to correct for population er- 
rors, for errors in sampling. The point at 
issue here is not a problem of sampling and 
population; it has nothing to do with the 
contineency of the coefficient upon the na- 
ture of the particular group tested; it is 
not a question of deciding how much these 
coefficients might be expected to fluctuate 
if other population groups were used. The 
same group of subjects provided data for all 
the coefficients in Table I. 

Fourth, there remains the practical prob- 
lem; What is the reliability of Form B? The 
maker of the test arrived at the particular 
combination and order of questions which he 
used in Form B largely on the basis of 
chance, This arrangement happened to show 
a reliability of .32. He might certainly 
have hit upon any other combination in Ta- 
ble I, except possibly the second and thira, 
in which case the calculated reliability co- 
efficient would have been quite different. 
And there is no reason to assume that the 
extremes of the possible range of coeffi- 
cients are represented in Table I. Only 
eight combinations of questions were stud- 
fed: there may have been some combination 
which would have yielded a coefficient as 
high as .60 and another, as low as .15. 

The particular alterations responsible 
for the changes in the coefficients are re- 
ported in the first column of Table I. Space 
limitations prevent tracing out the rela- 
tions involved, a study which the reader may 
make for himself, if he so desires, since 
the descriptions are sufficiently detailed 
to be self-explanatory. 

Test I, Form C. Whatever the reliability 
of Form B, it was still too low to permit 
the type of measurement proposed. Conse- 
quently, Form C was prepared. While further 
juggling of test-items might have raised the 
reliability coefficient to some extent, such 
a procedure would have given little assur- 
ance that the test itself had been rendered 
more reliable, Accordingly a more funda- 
mental type of change was selected, a type 
which had as its purpose the deliberate at- 





W. A, Brownell 





207 


tempt to make the two halves of the test 
(odd and even) equally representative tests 
of the field covered by the whole test. 
Among the 36 questions in Form B_ there 
were 13 pairs which were based on the same 
paragraphs. For example, the pair, ques- 
tions 3 and 16, were both answered in the 
fifth paragraph of the reading material. The 


|26 questions of these 13 pairs were distrib- 


uted in such a way that one member of each 
pair was odd-numbered, and the other, even- 
numbered. There were ten questions which 
could not be so paired. Of these, five were 
specifically answered in the material, but 
the answers were in separate paragraphs. In 
Form C three of these questions were number- 
ed odd, and two, even. The other five ques- 
tions were more general in character and 
were answered by the sense of several para- 
graphs together. Two of these, in FormC, 
were numbered odd, and three, even. Finally, 





four new questions, representing inferences 
| which might be falsely drawn regarding the 
| point of view held by the author of the 


reading selection, were added to the 36 
questions of Form B, and these four ques- 
' tions were equally distributed between the 


/odd and the even-halves of Form C. 

Form C was given to 66 new students under 
‘three sets of conditions: first, immediate- 
ly after study, as in the case of Forms A 
and B; second, within a period of eighteen 
hours, without opportunity to re-study and 
without notice of the re-test; and third, 
after ten days, in order to measure reten- 
tion. The three reliability coefficients 
resulting from correlating scores on the 
odds with scores on the evens were in ap- 
proximate agreement, .41, .36, and .41 re- 
spectively. 

As a check on the results secured in the 
case of Form B, the items in FormC were 
artificially combined in ten other different 
ways in order to study the effect of such 
manipulation on the reliability coefficients. 
The coefficients are reported in Table II, 
the first three being those mentioned above, 
The array of coefficients in the last column 








1. It is even conceivable that, starting out de novo, he might even have hit upon these. 
led by the artificiality which of course did characterize the construction of test-halves in this study. 
of arrangement which were tried out thus artificially were employed, frankly, in the spirit of curiosity 


ular types 


The reader should not be mis- 
The c-~. 


to see what would happen. But, as tests are prepared typically for classroom use, even the most extreme arrangements 


might be accidentally adopted. 





aeemne« tenant 


catenins” axtannemee aone nt ante SR age eA late ate ot — 


a 


ae 


PS <n REE. mR Ee ere 
Ser , 


‘ 
Lz 
f 
i 
\ 
é 
: 
ts 


eo 


<< 





208 JOURNAL OF EXPERIMENTAL EDUCATION 


will be seen to be similar to that for Form 
B in Table I, with a range from .23 to .47. 
With the exception of the descriptions in 
the last five rows of the table, the differ- 
ent methods of combining the items into 
halves will be clear from the statements in 
the left half of Table II. Mention has al- 
ready been made of the fact that the 66 stu- 
dents took Form C twice within eighteen 
hours, without warning of the proposed re- 
test and without opportunity to re-examine 
the reading material. A comparison of the 


TABLF II* 
DATA ON TEST I, FORM C,--40 ITEMS TRUE-FALSF 





TEST, 66 CASES 











Nature of Halves ron 
1. Odds vs. evens, original test ........ -41 
2. Same, re-test within 168 hours ......-- 36 
3. Same, re-test after 10 dayS ...sccseees 41 
4. Chance halveS ...cccccseccssccsccveses 32 
5. Same, another arrangement werrrrrr eer 32 
6. First half vs. second half ......... ee 223 
7. Halves equalized for difficulty ...... 33 
8. Same, another arrangement .....eeeeees 32 
9.’ More difficult half vs. less difficult 


Wall cccccccvcccccsecscccsessccsece 
10. Halves equalized for reliability ..... 46 
ll. Same, another arrangement ....eccseees 237 
12. More reliable half vs. less reliable 

half ee eee eee eer eee eenee eee eeeee 44 
13. Halves equalized for difficulty and re- 

liability eee eee ee eee ee eee er eer eeere 47 





*In this table and those which follow the exact 
distribution of questions, as for Form B in Ta- 
ble I, is omitted. 

**The coefficients in this table were calculated 
separately by two individuals, the writer and 
Dr. J. H. Edds, formerly the writer's assist- 
ant. P.E.'s for the r's range from .065 (for 


47) to .079 (for .23). 


two test-papers of each student showed the 
questions on which he made changes in his 
answers on the second test, from "true" to 
"false," or vice versa. A count of these 
changes for the whole group of 66 students 
furnished a rough scale of the “reliability” 
of the 40 test-items. Thus, at one end of 
the scale, only two students changed their 
answers to question number one, three stu- 
dents changed answers to questions 13 and 
35, and so on. At the other end of the 
scale, 22 changes were made in the case of 
question four, and 24 in the case of ques- 
tion 40, It was these "reliability ratings” 
of the various questions which were utilized 








Volume I, No. 3 


in the last five combinations of questions 
reported in Table II. 

Results with two other true-false tests 
(Tests II and III). Thus far the investiga- 
tion had dealt only with different forms of 
the same true-false test, covering the same 
body of subject-matter, devised by the same 
person, and given to a rather limited num 
ber of students. On the chance that the ex- 
treme variations found in the reliability 
coefficients might have been due, wholly or 
in large part, to one or another of these 
factors, the investigation was extended to 
include other and longer true-false tests 
which were based on different bodies of sub- 
ject-matter, constructed by two other per- 
sons besides the writer, and given to a pop- 
ulation about double the size of those used 
in the case of Test I. 

Test II was also a true-false test. It 
consisted of 40 questions, was based on the 
material contained in several chapters of 
Gates! Psychology for Students of Education, 
and was prepared by three instructors ina 
course in Educational Psychology, of whom 
the writer was one. This test was given to 
131 students with the results reported in 
Table III. The range of coefficients for 
this test is from .33 to .57, or if the co- 
efficient .33 (row 8) be regarded as unfair- 
ly low, as well as that of .35 (row 11), the 
range is from .41 to .57,. Though reduced in 
extent, marked variability still persists in 
spite of the changes made in test, subject- 
matter, test-makers, arid number of students. 
And there is, of course, no reason to assume 
that the variability may not actually have 
been as great as for Test I; the number of 
combinations of questions here attempted may 
have been too small to reveal the true situ- 
ation. 

Test III, a third true-false test, con- 
tained 50 questions, was based on another 
section of Gates! text, was prepared by the 
same three instructors as prepared Test II, 
and was given to 138 students. Twelve dif- 
ferent combinations of questions were tried 
out. The reliability coefficients, together 
with other data, are presented in Table IV. 
The range on this test is from .29 to .52. 
The lengthening of the test and the use of 














1. arthur I. Gates, Psychology for Students of Education (Revised Edition, New York: Macmillan Company, 1950). 





warch, 1933 


TABLE III* 


DATA ON TEXT II,--40-ITEM TRUE-FALSE TEST, 131 
CASES 





X-half X-half 





8.D. 
3.0 


Mean | S.D. | Mean 


13.7 


Nature of halves 





Tah +t 
| 49 





. Odds vs. evens 14.0 


. Same, four pairs 


of items changed | .57 | 13.9 13.7 2.6 





. Each third item, 
then each second: 
1,4,7, etc., and 





. Chance halves 


- Same, another 
rangement .... 


. Same, another 
rangement .... 


. First half vs. 
second half ..... 





True statements 
VS. falSe cccccce 
- Halves equalized 

for difficulty .. 


Same, another ar- 
rangement ...eoee 


More difficult 
half vs. less 
difficult half .. 


Most difficult 

5, skip next 10 
most difficult, 
take next 10, 


etc. 2.4 


-55 {14.3 | 2.9 | 13.4 

















W. A. Brownell 








*In this table and the two following tables,the 
means and standard deviations of the various 
test-halves are reported. These measured were 
omitted in Tables I and II because the data 
there presented were not considered of large 
importance. 

**All r's were calculated separately by two in- 
dividuals, the writer and Mr. Donald Agnew, a 
graduate student in the Department of Education 
at Duke University. P.E.'s for the r's range 
from .040 (for .57) to .053 (for .33). 


a different body of subject-matter did not 
eliminate the variability in reliability co- 
efficients, 

Results with a multiple-choice test (Test 
IV). All the tests studied up to this point 
were true-false tests. The variability of 
reliability coefficients might have been due 
to some peculiar property of this variety of 
test. To check this possibility a forty- 
cight item multiple-choice test was prepared 








209 


by the same three instructors in Educational 
Psychology who were responsible for Tests 
II and III. Again the test content was tak- 
en from Gates' Psychology for Students of 
Education. Each question provided four pos- 
sible answers, from which the best one was 
to be selected. The test was given to 134 
students, ' 
Table V contains the reliability coeffi- 
cients and other data obtained for this test. 
The range of coefficients is from .33 to .46 
for the twelve arrangements which were stud- 
ied, Variability is therefore seen to be 
characteristic of this type of test as well 
as of the true-false test. Whether the var- 





| iability is greater, less, or the same in 


amount, the data here presented are of course 
inadequate to determine. 


TABLF IV 


DATA ON TEST III,--50-ITEM TRUF-FALSE TEST, 138 
CASES 





x -half 


———E — — 


§.D.| Mean a | S.D. 


X-half 





Nature of halves Mean 


"oH 








elas 


1. Odds vs. evens 17.4 | 3.0 
2. Same, four 
pairs of items 


changed ....... 
Chance halves... 


Same, another 
arrangement ... 


Same, another 
arrangement ... 


First half vs. 
Second ....cees 
True statements 
vs. false state- 
ments eee eeeeee 


Halves equalized 
for difficulty. 


Same, another 
arrangement ... 


Same, another 
arrangement ... 


More difficult 
half vs. less 
difficult half. 





Most difficult 

5, skip next 10 

in difficulty, 
take next 10, 

¥See second Potnote # hess , 


this table range from .042 (for .52) to 088 * (for *29). 


12. 

















a ; 


4 


- os 
—_ <@-. 


~ i ae 


+~ 
- — 
< ¥ 


= 


o ~~ a 
: - 
Sa SF Semis ae 


oo 


5 
at as ee. eee oe 


a 





eee ee 


eedaiee- aie 
“" 


_— 


210 JOURNAL OF EXPERIMENTAL EDUCATION 


From the descripviions provided in the ta- 
ble the reader will be able to discover the 
methods employed in making up the various 


test-halves, except in the case of the sixth, 


seventh, and twelfth arrangements. In the 
case of the sixth and seventh, the test- 
halves were constructed with reference to 
the number of the correct answer among the 
four possible answers provided. In the case 
of the twelfth, a deliberate attempt was 
made to lower the reliability coefficient. 
The method was as follows: first, the three 
arrangements which had yielded the lowest 
coefficients were selected (rows 2, 3, and 
10). Second, the items contained in one of 
the halves for each arrangement were listed 
side by side by number. Third, those items 
which were found in all three halves were 
first chosen for one-half of the new ar- 
rangement; then those which appeared in two 
of the halves were selected, and finally, a 
few which appeared in only one-half, in or- 
der to make up the requisite twenty-four i- 
tems. The effort to lower the coefficient 
by these means was not successful, 


PREVIOUS STUDY OF THE PROBLEM 


So far as could be discovered by a care- 
ful examination of the periodical litera- 
ture the particular problem which was the 
object of this study has never before been 
investigated. That is to say, no one has 
before reported data on those difficulties 
involved in determining true reliability by 
the split-halves technique which have been 
studied in this investigation. 

Various studies have been made, which 
have had as their purpose the investigation 
of various factors which influence test re- 
liability and, more especially, of the val- 
ue of the Spearman-Brown prophecy formula 
for this purpose. Among these studies there 
are two types which bear indirectly upon the 
problem of this paper. 

The first of these types has to do with 
the relation between test-length and test- 








Volume I, No. 3 


TABLE V 


DATA ON TEST IV,--48-ITEM MULTIPLE CHOICE TEST, 
134 CASES 


X-half 





Y-half 





Nature of test- 


halves "ha" Mean | S.D. Mean | S.D. 





1. Odds vs. evens 44 |14.3 3.5 14.6 3.0 


2. Chance halves..| .33 |14.5| 3.1] 14.5] 3.1 
3. Same, another 
arrangement ...| .27 |14.3/; 3.3 | 14.4/ 3.4 


4. Same, another 
arrangement .../ .43 |14.1 3.2 14.7 3.2 


5. First half vs. 
second half ...| .46 /14.1 3.2 14.8 3.2 


6. Questions whose 
answers were 
choices 1 or 2 
vs. questions 
whose answers 
were choices 3 
or 4 eee eeeeeee 43 14.6 3.3 14.3 2.9 


7. Questions whose 
answers were 
choices 1 or 4 
vs. questions 
whose answers 
were choices 2 
or 3 see eereeeee 43 14.5 3.3 14.4 3.1 

8. Halves equalized 
for difficulty. , .46 |14.6 2.9 14.3 | 3.3 


9. Same, another 
arrangement ...| .46 |14.4/ 3.3 | 14.5 / 3.2 


10. Same, another 
arrangement .../| .36 |14.4/ 3.3 | 14.5] 3.2 


ll. More difficult 
half vs. less 
difficult half./| .35 /11.5 3.2 17.3 | 3.2 


12. Deliberate ef- 
fort to lower 
coefficient ...| .36 {14.5 3.0 14.3 3.4 




















*See first footnote to Table III. P.E.'s forr 
in this table range from .045 (for .46) to .052 
(for .33). 


reliability. Four which are of special val- 
ue for peng, geared measurement are those by 
Holzinger, } Holzinger and Clayton,* Ruch et 
al.,> and Lanier.4 

Holzinger dealt with ten sub-tests of two 
forms of the Terman Group Test of Mental 





1. Karl J. Holsinger, "Note on the Use of Spearman's Prophecy Formula for Reliability," Journal of Educational Psychol- 


XIV (May, 1923), 802-05. 


° 
i J. Holsinger and Blythe Clayton, "Further Experiments in the Application of Spearman's Prophecy Formula," Jour- 


” pal of Educational Psychol XVI (May, 1925), 289-99. 
8. Giles M. Ruch, Luton Ackerson, and Jesse D. Jackson, "An Empirical Study of the 


Formula as Applied to 


Spearman-Brown 
Educational Test Material," Zgumal of Fducat tone} Paycho log XVII (May, 1926), 509-15. 
4. Iyle H. Lanier, "Prediction ty ests and Tests of Special Abilities," Journal of Experi- 


mental Psychology, X (February, 1927), 69-115. 





March, 1933 


Ability. A series of coefficients was ob- 
tained by correlating scores on the first 
unit of one form with scores on the first 
unit of the second form, then by correlating 
the combined scores for the first two units 
on the two forms, then the combined scores 
on the first three units, and soon. This 
series of coefficients differed considerably 
from the theoretical coefficients derived by 
the use of the Spearman-Brown forma, 

Holzinger and Clayton in their article 
really report two separate investigations. 
The first of these made use of scores on ten 
units of two forms of the Otis Self-Adminis- 
tering Test of Mental Ability, Advanced Ex- 
amination, The conclusion of this study was: 
"Acreement with the Spearman-Brown Formula 
is far from close with material which upon 
casual inspection would appear to be fairly 
well graduated and homogeneous. Extreme 
caution in the application of the formula to 
such test material would therefore seem to 
be necessary.” Their second investigation 
was based on scores made on seven pairs of 
spelling tests, the words of which were care- 
fully selected from the Buckingham revision 
of the Ayres scale, in such a way that these 
words were of equal difficulty. With this 
material which was "fairly well graduated 
and homogeneous" the actual coefficients and 
the theoretical coefficients approximated 
each other very well. 

Ruch, et al., in their study likewise 
made use of spelling words, twenty cycles 
of spelling tests, each test containing 25 
words, chosen from the Ashbaugh Iowa Spell- 
ing Scale in a mariner to insure equal diffi- 
culty of the tests. Here again, when the 
assumptions involved in the Spearman-Brown 
formula had been met as closely as possible, 
there was agreement between the expected and 
the obtained coefficients. 

Lanier! made use of twelve different kinds 
of test material. The predicted coefficients 
were quite different from the obtained coef- 
ficients in the case of ten of the twelve 
tests. 


W. A. Brownell 








211 


It is not the purpose in citing this se- 
ries of studies to renew the debate on the 
relation between test-length and test-relia- 
bility. Rather, the purpose is to call at- 
tention to two underlying faciors in these 
studies which are mentioned by certain of 
the writers. Holzincer has emphasized the 
importance of the nature of the test materi- 
al in any consideration involving reliabil- 
ity. Ruch has stressed the mathematical as- 
sumptions inherent in the Spearman-Brown 
formula. The studies reviewed above show 
clearly that the test items must be "homoge- 
neous" (to use Holzinger's term) if reliabil- 
ity is correctly to be measured by correlat- 
ing test-halves. Ruch has stated that the 
assumptions in the prophecy formula require 
that "the standard deviations of the unitary 
tests must be 6qual and that the intercorre- 
lations of all pairs of tests must be equal." 

The other type of study which is relevant 
to the problem of this paper is that which 
has shown the variation in reliability coef- 
ficients obtained by correlating different 
sections within a given test. Of the stud- 
ies of this variety two may here be mention- 
ed, those by Wood” and by Jones.> It should 
be noted that the findings to be cited were 
not presented in the original reports for 
the purpose for which they are here used, 

Wood experimented with a number of types 
of subject-matter. His report includes data 
based on the responses of 100 college stu- 
dents to a 100-item Recognition Test in 
French Vocabulary, a 200-item True-False 
Test on the comprehension of French sentenc- 
es, a 140-item True-False Test on Equity 
(Law), a 180-item True-False Test on Legal 
Pleading and Practice, and so on. Only a 
small part of his study and of his data is 
here reviewed. The 200-item French Transla- 
tion Test was broken up arbitrarily into 
tests of varying length, containing 10, 
30, 40, and SO items each. Correlations 
were then computed between the different 
pairs of sub-tests of the same length. Thus, 
eight coefficients were calculated for pairs 


20, 





l. The twelve types were: one Otis Self-Administering*Test, 
five tests of "mechanical abilities." 


Advanced Examination; six of the Seashore Musical Tests; and 


2. Ben D. Wood, "Studies of Achievement Tests, Part III," Journal of Educational Psychology, XVII (April, 1926), 263- 


269. 





5. Harold E. Jones, "A Comparison of Objective Examination Methods," Journal of Educational Me » VIII (February, 1929) 


275-276. 








212 


of 10-item tests. The highest of these (based 
on scores of "rights," as were all the coef- 
ficients here cited) was .564; the lowest was 
-185. The variation for ten coefficients 
computed for pairs of 20-item tests was .698 
to .372; for three pairs of 30-item tests, 
.696 to .665; for four pairs of 40-item 
tests, .805 to .621; for four pairs of S0- 
item tests, .866 to .750. Similar results 
were secured for the other tests. For ex- 
ample, in the case of the test on Pleading 
and Practice the range for pairs of 10-item 
tests was .463 to .185; for 20-item tests, 
-626 to .389; for 30-item tests, .768 to AS]; 
for 40-item tests, .776 to .624; for 90-item 
tests, .866 to .764. 

Jones reports results secured when he 
broke up a 130-item true-false test on the 
material of educational psychology into five 
sub-tests of 26 items each, “each part hav- 
ing equal representation from all sections 
of the test." His data are based on 142 pa- 
pers. The mean of the ten coefficients ob- 
tained by correlating scores on pairs of 
sub-tests averaged .355, with a range in 
P.E.'s of .03 to .05. The original coeffi- 
cients, not reported in the article, must 
have been between approximately .34 and .68, 
(These rts have been secured by substituting 
-03, and .0S, in the formula P.E., = .6745 

si—, and solving for r.) 

These two investigations (others similar 
in kind might be mentioned) have been refer- 
red to in order to emphasize in a different 
way the dependence of the size of the relia- 
bility coefficient of a given test upon var- 
ious factors. In the case of the Wood and 
the Jones studies reliability coefficients 
for the tests as wholes would have varied 
materially in terms of the particular pair 
of sub-tests which were correlated to fur- 
nish the basis for the prediction. Thus, in 
the case of the Wood French Translation Test, 
if the highest coefficient obtained by cor- 
relating pairs of 40-item tests had been 
used for prediction, the reliability coeffi- 


JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No. 3 


cient would have been .95; if the lowest 
had been used, it would have been .89.1 In 
the Wood Pleading and Practice Test the 
predicted coefficients from even the com 
paratively long 90-item tests would have 
been no closer in agreement than .93 and 
-87. And Jones might have obtained relia- 
bility coefficients for his 130-item test 
as widely different as .72 and .92, depend- 
ing upon the pair of 26-item tests used as 
the basis of prediction.~ 


CONCLUDING STATEMENT 


In publishing the results of this study 
there is no notion of making a contribution 
to the statistical theory of educational 
measurement. No attack is contemplated up- 
on the Spearman-Brown Prophecy Formula as 
such. The present study does not, and by 
reason of its nature could not reveal weak- 
nesses in the mathematics underlying the 
formula. On the contrary, the study as- 
sumes the validity of the formula and is 
concerned only with an attempt to satisfy 
the conditions which justify its use. These 
conditions have to do, primarily, with ob- 
taining a coefficient of correlation be- 
tween scores on the test halves. 

Evidence has been adduced which would 
seem to demand considerable caution in the 
determination of the reliability of short 
objective tests through the correlation of 
test-halves. It has been shown that the re- 
liability coefficient so obtained varies 
markedly with the nature of the distribution 
of test-items between the halves of the test. 
The range of coefficients secured in this 
study by the manipulation of test-items in 
such a way as artificially to construct dif- 
ferent test-halves should be convincing tes- 
timony in this connection. The subsequent 
application of the Spearman-Brown formula 
can be relied upon in no way to correct for 
errors introduced antecedent to its applica- 
tion. There is no magic in the formula, a 





1. When the inter-correlation between test-parts vary widely it is the practice to use the average of the coefficients 
as the basis of prediction. An example of this practice, together with an excelleni @iscussion of the need for the 


same, is to be found in Ruch, op. cit. 


2. While this variation in reliability coefficierts agrees with those reported for the present study, this agreement in 
variation should not conceal the difference in the experimental approaches in the two types of study. In the Wood 


and Jones investigations all sub-tests of the same length contained different items: 


peared in only one of the series of sub-tests of the same length. 
the constituent elements appeared again and again in differ- 


were constructed by variously arranging the same items: 


the constituent elements ap- 
In the present study the different test-halves 


amt sub-tests, though of course a given item appeared in only one of a given pair of sub-tests which were correlated. 


March, 1933 


property which seems to be required by the 
treatment of the topic in many measurement 
texts. On the contrary, the formula accepts 
at face value the correlation coefficient 
for the test-halves however inaccurate and 
invalid it may be. From this point on, the 
process which eventually yields a reliabil- 
ity coefficient for the whole test is a mat- 
ter of simple mathematical calculation. 

The crux of the matter lies therefore in 
the fact that the correlation of scores on 
test-halves measures only the degree to 
which scores on the two parts are consist- 
ent. Any manipulation of test-items, or of 
other conditions, which improves this con- 
sistency therefore becomes a method of in- 
creasing the reliability coefficient whether 
it increases the reliability of the test or 
not. Under such circumstances reliability 
may readily cease to be a basic characteris- 
tic of the measuring instrument and may be- 
come, either through ignorance or through 
design, the object of clever maneuvering and 
juggling. 

The coefficient obtained by correlating 
test-halves and then by stepping-up the 
measure through the application of the Spear- 
man-Brown formula is not rightly the relia- 
bility coefficient for the given test unless 
the assumptions underlying the formula are 
satisfied. These assumptions are summarized 
with especial clearness by Professor Walter 
S.Monroe of the University of Illinois, ina 
letter to the writer: 


"The formula (as applied to test-halves) 
hypothesizes four measures. Two of these 
measures are yielded by the halves of 
the test administered. Call these x, 
and y,. Each of these measures has a 
hypothetical mate which may be designat- 
ed by Xp and yp. The derivation of the 
Spearman-Brown formula assumes the 
standard deviations of the four sets of 
measures are equal and also that the 
various intercorrelations are equal. 
Thus two tests are hypothesized, one 
consisting of one-half of the original 
test plus an equivalent test; the other 
test is made up of the other half plus 
an equivalent test. The measures yield- 
ed by these two tests may be represent- 
ed by x, + X, and y, + Ww. 





W. A. Brownell 


"Assuming these conditions satisfied, 
the estimated coefficient of enn: 
tion, which is Tin + + 

the coefficient of Po. Se _ 
the measures yielded by the two hypoth- 
esized tests when administered under 
identical conditions. Whether this co- 
efficient should be recognized as the 
coefficient of reliability of the o- 
riginal test, depends upon the equiva- 
lence of the hypothesized tests and 
the meaning to be associated with the 
coefficient of reliability in this 
case. These hypothesized tests will 


not be equivalent unless the halves of 
the original tests are equivalent." 


The difficulty of meeting the condition 
of statistical equivalence in short objec- 
tive tests is exemplified in this study. 
Reference to certain data in Tables III, IV, 
and V will serve to make this clear. In these 
tables several instances may be found in 
which pairs of test-halves, equal as to M 
and S.D., yielded quite different correlation 
coefficients. In this study the relative 
difficulty of test-items was not known at 
the outset. That much the same difficulty 
obtains in establishing statistical equiva- 
lence even when the difficulty of test-items 
is knowm in advance, is well demonstrated in 
Ruch's study,2 

If one approaches the problem from ‘the 
standpoint of test-content instead of from 
the standpoint of statistics, and if, in so 
doing, he attempts to make his test-elements 
"homogeneous," he encounters other problems, 
Just what is this “homogeneity?” Does this 
characteristic refer to items of equal dif- 
ficulty, or to test-lalves of equal diffi- 
culty, or to equivalent representation of 
the field covered by the test? Some of these 
conditions were met in the various arrange- 
ments of test-items in this study, and yet 
the data in Tables III, IV, and V, ma well 
as those referred to in Jones! study,” show 
that the correlation coefficients varied, 

And there remains the further practical 
problem: If homogeneity implies items of 
equal difficulty or test-halves of equal dif- 
ficulty, such as prevailed in the excellent 
studies reviewed in which spelling words were 
used, it must be insisted that such conditims 





I. G. W. Ruch, et al. op. cit. 
2. Harold E. Jones, op. cit. 












+ ee ee 


folie 


eS ad = 
8 Seth hei Ae Ae 


~— 


a Be 


inp MTS, 


eRe ee ’ » eS 


dees a an 
- at -_— 


oe 





v 


_ - 


- 
t 
a 


a. in ee - 


/ - 
oe re ek. 


214 


are not at all characteristic of the situa- 
tion in which one typically sets about the 
preparation of an educational test. Custom- 
arily one does not start with fore-knowledge 
of the comparative difficulty of the items 
to be included. Furthermore, it is only in 
rare circumstances, such as appear seldom 
even in the subject of spelling, that one 
has the chance to use items of equal diffi- 
culty. Rather, one starts with items which, 
it is supposed, cover the field of the test 
in an adequate fashion without knowing their 
relative difficulty and certainly without 
having (and most generally without wanting) 
test-items equal in this respect. In a word, 
the conditions of the present study are sim- 
ilar to those under which short educational 
tests are usually devised,! and the conflict- 
ing results which were obtained may be ex- 
pected to appear under the corresponding 
conditions of the classroom, 

The foregoing general discussion suggests 
certain applications. In the first place, 
it suggests that if a text on educational 
measurement treats the split-halves technique 
at all, it should go further than a mere de- 
scription of the method. Such an abbrevi- 
ated and uncritical account encourages mis- 
interpretations and misuse. It provides the 
reader with an implement without assuring 
intelligence in its use. 

The points made in the foregoing discus- 
sion also raise a question concerning the 
validity of reliability coefficients for the 
various short standard tests. Heeding the 
advice of the more critical writers on edu- 
cational measurement, the makers of these 
tests report not only the size of the reli- 
ability coefficient, but also the method of 
its computation, and some of them also re- 





JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No. 3 


port the S.D.'s obtained. Still, one may 
reasonably wonder about such coefficients 
for short tests when these coefficients have 
been computed by the split-halves method, 
How fortunate was the test-maker in the par- 
ticular arrangement of his test-items as be- 
tween the halves? , 

And there is a third application of this 
discussion, perhaps the most important one, 
The periodical literature is filled with the 
reports of experimental investigations in 
which a single measuring instrument has been 
used as a means of evaluating the results se- 
cured, Frequently these instruments are 
short, and commonly they contain the most 
surprising types of elements. In such cir- 
cumstances the split-halves technique is re- 
sorted to as the method of establishing the 
reliability of the instrument. It is indeed 
rare that any evidence is furnished to sup- 
port the belief that the reliability coeffi- 
cient reported has been calculated in a man- 
ner which honors the assumptions basic to 
the formula. And there is good reason for 
this absence of evidence:: If the worker has 
received his training in measurement from 
the typical text in that field and his train- 
ing in statistics from the typical text in 
that field, he is hardly prepared to exercise 
caution. His one safeguard, outside of the 
supplementary information which he may ac- 
quire from instructor or other texts, is the 
experimental literature on the topic. In view 
of the minor importance he is apt to attach 
to the mathematics of his formulas, it is 
hardly likely he will be fortified from this 
source, 

The general conclusion, then, is that 
which has been printed at the ends of numer- 
ous studies concerned, like this one, with 





l. As a matter of fact, it might properly be argued that the conditions under which the tests studied in this investiga- 


tion were prepared were unusually favorable to the estimating of reliabil‘*, from test-halves. 


The three individuals 


who constructed the more important tests were certainly more familiar wits. she various techniques involved in test 


construction than are typical classroom teachers. 


Perhaps at this point note should also be made of the fact that the tests studied in this investigation were all 


short. 
efficients would have been auch less. 


automatically removed many of the sources of trouble here found in the short tests. 


this hypothesis. 


It is quite probable that had tests of 100 or more items been employed the variability of the reliability co- 
The doubling or tripling of the number of items in the test-halves might have 


Data might well be secured on 


2. It may be thought that such a treatment is more properly the function of the writer on educational statistics. In 
reply it should be noted that the type of treatment contemplated need not be extended beyond the giving of a few nec- 


essary cautions. 


Perhaps more important is the fact that the writers on statistics are equally remiss in their treat- 


ment of the topic, and are likewise responsible, or perhaps jointly responsible, for failure to inform the worker in 


education. 


A check on six standard references in statistics shows that (1) one text hardly discusses test reliability at 


all, not even mentioning the 


formula; (2) another shows the use of the formula only for the purpose of 


Spearman-Browm 
estimating the degree to which a given test must be lengthened to secure a certain degree of reliability; (5) two 
others show the use of the formula for the purpose of estimating reliability from test-halves, one of them stating 


March, 1933 W. A. Brownell 215 


research techniques: There is grave danger latter may be for the purpose of securing a 
in applying procedures without an intelli- constant of considerable practical value, 
gent grasp on their legitimate uses and on but it may be quite open to objections of 
their limitations. As applied to the par- | theoretical (that is, mathematical) sort. 
ticular problem of this paper, the general The constant obtained by the first computa- 
conclusion may be elaborated: To calculate tion simply measures the degree to which 
the coefficient of correlation between test- | scores on the test-halves are consistent; 
halves is to do one thing; to predict from the constant obtained by the second computa- 
this coefficient the reliability of the en= (tion measures the reliability of the whole 
tire test by invoking the Spearman-Brown |test only when the conditions of the formula 
formula is quite another matter. To do the (are satisfied. 

first is to do something which may be of 
little practical value but which is quite 
unobjectionable theoretically; to do the 











(Footnote continued) 
that the formule has been "attacked" and that it cannot be used in all cases as its usefulness "depends on the nature 


of the instrument," while the other advises caution on the ground that the formula requires that the test-halves be 
"approximately equal in difficulty and content"; neither, however, discloses the mathematical reasons; and (4) the 
remaining two show the derivation of the formle. It should be stated, however, that these two texts are the most 


technical and probably among the least used of the six. 





neni nee = eee 


pan AS Btn te bese ee ne eed. 2 ce c “ 
a 2 ns *; “' ? ix * 4 
— —— 7 en 
. ne et 
$ 3 ey: 









This report is concerned with the analy- 
sis of current textbooks and treatments of 
educational statistics with a view to dis- 
covering the extent to which they agree or 
disagree in their systems of notation, sym- 
bols, and formulae for the expression of the 
more common statistical concepts. The prob- 
lem had its inception in extended experience 
with students of statistics who have report- 
ed much difficulty in reading various texts 
because of the confusion which they encount- 
er in the use of rotation. 

It is doubtless true that the trained 
mathematician or statistician will be but 
Slightly delayed in the reading and study of 
such varied treatments since he has mastered 
the basic concepts and has become skillful 
in manipulating symbols, however irritated 
he may be by what he considers the misuse of 
terms, This is not the case, however, with 
the ordinary beginner in the study of educa- 
tional statistics. In fact, the effective- 
ness with which his learning progresses 
doubtless depends greatly upon the extent 
to which he is kept free from confusion, 

Educational statistics has derived its 
procedures from 4 variety of sources which 
are often greatly at variance as to their 
systems of notation. Scme writers in this 
field have been trained in mathematics and 
have naturally been influenced by this back- 
ground. Others may have adopted the spe- 
cialized procedures of the natural or engi- 
neering sciences for a similar reason, The 
difference between the symbolism of the nat- 
ural sciences and that of the social sci- 
ences is illustrated by the fact that in the 
former r may stand for probable error, p for 
probable error of a curve that is tabulated, 
and » for mean error. Some writers in edu- 
cational statistics have been influenced, 
not only by the more common notations used 
in the field of business, but also by the 


216 JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


NEED FOR STANDARDIZATION OF SYMBOLS AND FORMULAE IN EDUCATIONAL STATISTICS 
Paul V. West 
School of Education, New York University 


procedures used by specific authors. It ap- 
pears that the notations used have been more 
largely determined by the English school, 
represented by Yule, Pearson and Spearman, 
than by any other single factor, although 
there has been no slavish imitation. 

It appears further that some writers have 
deliberately modified the systems of nota- 
tion with which they are acquainted, or have 
introduced novelties of their own with a view 
to simplifying the symbols and procedures for 
the beginner and lay reader. Several authors 
have sought simplicity by using verbal terms 
in full, or in slightly abbreviated forn, and 
by avoiding the use of symbols to any marked 
degree. This is in marked contrast to others 
who use elaborate systems of notation. Al- 
though the former procedure may be justified 
to some extent as a primary method of in- 
struction for the beginner, it is evident 
that the learner must, sooner or later, ms- 
ter an adequate notation system if he is to 
become efficient in solving problems and re- 
cording his solutions, and if he is to de- 
velop any ability in reading modern educa- 
tional literature. 

The books investigated include 16 treat- 
ments. Some of these are texts which deal 
with the elementary phases of the subject, 
whole others include more advanced material. 
Some are work-books or manuals. No attempt 
was made to examine the cursory treatments 
of statistics given in textbooks dealing 
mainly with educational measurements, educa- 
tional psychology, and research methods. 
Neither was any consideration given to the 
presence or absence of treatment of specific 
topics, or error in the treatment of these 
topics, Care was taken not to include ap- 
parent typographical errors among the vari- 
ant forms. The list of books follows. All 
are widely used in education, either as texts 
or references, 








1. Address delivered before American Bducational Research Association in February, 1952, at Washington, D.C. 





warch, 1933 


H. E. Garrett, 
1, A. Greene, cece 
. A. Gregory, and 
O. W. Renfrow, 
. do Holzinger, ..coce 
. Le Kelley, cecccccece 
FE. E. Lindquist, and 
G. D. Stoddard, 
Marion E. Macdonald, .. 


. L. Morton, 


. W. Tiegs, and 
C. C. Crawford, 
. L. Thurstone, 


- Le Whitney, cccccoee 


- He. Williams, .cccece 


P. V. West 


Statistics in Psychol- 
Ogy and Education 

Work Book in Education- 
al Measurements 


Statistical Method in 
Education and Psychol- 


ogy 
Statistical Methods for 
Students in Education 
Statistical Method 


Study Manual in Elemen- 
tary Statistics 
Practical Statistics 
for Teachers 
Laboratory Exercises in 
Educational Statistics 


- Educational Statistics 
- Statistical Method in 


Educational Measure- 
ment 
A Primer of Graphics 
and Statistics 
Statistical Methods Ap- 
plied to Education 


Statistics for Teachers 

The Fundamentals of 
Statistics 

Statistics for Begin- 
ners in Education 
Elementary Statistics 


The problem of consistency naturally di- 


vides into two parts: 


first, the consisten- 


cy of an author with himself, and second, the 
intra-consistency of the various texts. Con- 


sistency also may be of two types: 


first, 


the degree to which authors use a system of 
notation which provides one symbol and only 


one for any specific term, and second, 


the 


extent to which any specific notation is used 


for only one term. 


Instances of two corre- 


sponding types of inconsistency are given in 


the cases following: 


TYPE ONE 


Two or more symbols for the same term 


Specific mumber of cases ......- 


N, n 


um seer eee eee eer ee eee eee eeeeeee Z> 


Mean AVETAZE cesccserccceseseses 
Deviation from mean (interval) . 


M, X, T.M., A 
a', x, x-¥ 


Unit-Correction about trial mean c, C 


Median eeereeeeeeee ee ee eee eeeeeaee 
Assumed Mean eeeereeeee eee ee eee ee 
Number of measures in column ... 


Md., ee Md. 
Est. Mean; A. M. 
Ny, Mx, Lx 


TYPE TWO 


The same symbol used to designate two or more 


terms 


M ..e. measure; midpoint of interval, slope of 


line. 


S .... sum; uncorrected standard deviation, sum 
of quotients of squares of cell entries 
divided by independence values. 





Type Two (Continued) 


sum; numerator of fraction for correlation 
ratio. 

number in specific series; number in inde- 
terminate series. 

unit correction of assumed mean; coeffi- 
cient of contingency. 


Textbook Bb, a later and widely used text, uses 
five different symbols for the mean average 
(A. 8, 3, AV (obt.)? Avy); two (D, and x') or 


the interval deviation from the mean, and two 
(S, and O(es¢,) ) for standard error of estimate. 


In this text, also, M stands for both measure 
and mean average, while D is used to indicate 
deviation from the mean in terms of units and 
intervals, difference between means, and dif- 
ference in ranks. 


Numerous cases of such inconsistencies 
might be cited, but most of them are of minor 
Significance in the average text. It would, 
indeed, be bothersome to the beginner if the 
text he was using presented a great variety 
of symbols to express the same meaning, or 
if it employed the same symbol with a varie- 
ty of meanings. This does not occur to any 
great degree, however, and the variations 
that do exist are so presented in contextual 
development as to cause a minimum of confu- 
Sion. The chief concern of this investiga- 
tion is with the variations in the symbolism 
in different textbooks, 

In List I, will be found the varied mean- 
ings given in these texts to each of the 
capital and small letters of the alphabet 
with frequency of occurrence of each meaning 
placed in parentheses at the end of each 
term, It my be noted that no combinations 
of letters such as A.M., S.D., P.E., U.l., 
etc., are here presented, nor is the use of 
letter symbols as subscripts here indicated, 
although these are frequent occurrences. 


LIST I 


Letter Symbols and Their Varied Meanings 
With Frequency of Occurrence in Sixteen Texts 
in Educational Statistics 


Central Tendency (1) Assumed Mean (1) 
Mean (3) 

Weight in multiple regression equation (1) 
Correction in units (3) 

Contingency coefficient (5) 

Constant in regression equation (3) 
Criterion (2) 

Unit deviation from Mean (2) 

Interval deviation from Mean (1) 
Difference between Means (2) 

Difference in Ranks (10) 

10-90 percentile Range (1) 








218 


List I (Continued) 


Interval deviation from Assumed Mean (1) 
Probable Error (1) 

Frequency (2) 

Frequency in grouped data (1) 

Frequency up to interval (1) 

Geometric Mean (1 

Gains in Ranks (4 

Harmonic Mean (4) 

Heads of Binomial Curve (3) 

Units in Interval (3) 


aaew a Q 
' 


Constant in regression equation (1) 
Term of Conversion Series ‘7 
Coefficient of alienation (1 


ec 
' 


Measure (1) 

Mean (13) 

Median (1) 

Midscore (1) 

Number of cases (15) 

Origin (4) 

- Probability frequency (3) 

Intersection of ordinate and abscissa (4) 

Sum of Partial Scores in Contingency (1) 

Composite Price (1) 

Percentile (2) 

Probability frequency (3) 

Intersection of ordinate and abscissa (2) 

Quartile Deviation (14) 

Rank (3) 

a) Footrule Coefficient of Correlation 
ll 

Spearman rank difference of correlation (1) 

Reliability coefficient of large group (2) 

Multiple correlation coefficient fe) 

a from fluctuation of sam- 
ple 

Correlation of sampling errors 

Sum (1) 

Measure (1) 

Sum of Sums (1) 

Sum of Partial Sums (1) 

Frequency up to interval (2) 

Standard Error of Estimate (1) 

Uncorrected standard deviation (4) 

Standard deviation of converted series (2) 

Number in each cell + independence value (4 

Standard Deviation of broadly grouped data 

Frequency Total (1) 

Time (1) 

Tails in Binomial (3) 

Per cent of Unlike Signs (2) 

Coefficient of Variability (6) 

Correlation Table (1) 

Scale of diagonal Values in correlation 
table (1) 

W - Weight (1) 

Measure (6) 

Unit deviation from Mean (1) 

Predicted Score (3) 

Score on converted series (2) 

Predicted Score (4) 


1) 


Cc 
' 


vad 
' 


Measure (6) 
a - Number of cases in ++ quadrant s 
© a¢ U] 2 
Regression coefficient (5) 
(1) 
Number of cases in +- quadrant (2 
oJ f f * 4+ f 


Mode (2) 
4 
b - Number of cases in +- quadrant 
n " " 
Constant of regression coefficient (intervals) 
Correction for Mean in intervals (6) 





»o 


oe Dou 


@c 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No. 3 


Correction for Mean deviation from Mean or 
Median (4) 
Weighted composite score (1) 
Contingency coefficient (1) 
Correction for Mean in units (4) 
Number of cases in -- quadrant (3) 
Deviation from Mean (units) (5) 
" " Assumed Mean Units (5) 
" " " " Intervals (9) 
Difference between ranks (1) 
Deviation from Mean intervals (1) 
Interval deviation from Assumed Mean (1) 
frequency (1) 
frequency in ungrouped data (1) 
frequency in interval (3) 
gains in ranks (6) 
units in interval {2} 
units in interval (2 


number of classes or intervals (2) 
constant in regression equation (1) 
any — of subscripts other than 1 and 


2 (1 
rank (1) 
number of units in interval (2) 
coefficient of alienation (4) 


- lower limit of interval (1) 
- measure (5) 


midpoint of interval (1) 
mean (1) 
correction in units (1) 
slope of line (5) 
number of cases, specific (2) 

" " n indeterminate (6) 
criterion (1) 
per cent of cases falling below (3) 
general probability frequencies (3 
Binomial expansion (1) 
price (1) 
general probability frequencies 
binomial expansion (1) 
quantity a). 
coefficient of correlation P.M. 
rate (2) 
Spearman rank difference correlation (1) 
units in interval (1) 
measure (1) 


(3) 


(16) 


date (1) 

number of elements in composite (1) 
time (2) 

number of classes or intervals (2) 
commodity (1) 


number of elements in single test (1) 

upper limit of interval (1) 

ranks (2) 

lower limit of interval (1) 

deviation of diagonal values in correlation 
table (1) 

weight (1) 

measure (2) 

standard score (1) 

deviation from mean units (4) 

deviation from mean intervals (1) 

measure (1) 

deviation-from mean units (4) 

deviation from mean intervals (1) 

standard score (3) 

mean ordinate (1) 

height of ordinate (2) 

difference in means (1) 


Three letters (L, J, j) are not used. 
Sixteen of the letters have only one assigned 


meaning: 
tal letters and a, e, g, h, i, 1, 9, u 
among the small letters. 


B, I, N, 0, U, W, Z among the capi- 
w 


e frequencies for 


these letters are comparatively small. 


March, 1933 P. V. 


Six letters (E, G, H, V, Y and n) are each cred- 
ited with two meanings. 

Ten letters (A, F, K, Q, T, fy 5 Ty Vy y) 
have three varied uses. 

The letters C, M, Z, b, p, t, x, and z (eight in 
all) each occur with four meanings. 

Four letters (D, P, m, and s) are credited with 
five meanings. 

Two letters ‘3 and K} have six meanings. 

Two letters (R and c) have seven meanings and one 
(S) has as many as ten uses. 


each 


It is evident from these data that some of 
the letters are overworked in the sense that 
they are used as symbols for a great variety 
of terms, while others are used sparingly or 
not at all. There are so many statistical 
terms to be represented that it is indeed 
doubtful if they could be represented in any 
single treatment without some duplication in 
the uses of symbols. Such duplication should 
be reduced to a minimum, however, and as- 
signed meanings should be clearly differen- 
tiated in the case of any letter symbol. It 
is probable that by the proper use of the 
prime symbol, subscripts and letter combina- 
tions, the texts could be made much more con- 
sistent; that the average number of terms 
assigned per symbol could be definitely re- 
duced from the present figure of 2.8; and 
that the tendency to use certain letters to 
the point of confusion could be avoided, 

Except ford, used rather consistently for 
"sum" or "summation," o for Spearman Rank 
Difference Coefficient of Correlation, ando 
for standard deviation, the Greek alphabet is 
not of frequent occurrence. Only ina few 
instances do inconsistencies occur, as in 
the case of rn used for correlation ratio in 
general, but given the meaning of bi-serial 
correlation coefficient by one author; of & 
which occurs as the notation for coefficient 
of linearity as well as for the difference 
between the mean and the assumed mean, and 
of C which is used both as the symbol for co- 
efficient of linearity and for the interval 
deviation from the assumed mean, 

The difficulty which the beginner experi- 
ences when he undertakes extensive reading 
is perhaps better portrayed in List II in 
which the frequency with which various sym- 
bols are used to express the same basic mean- 
ing is noted, Not all of ths texts intro- 
duce all of the concepts listed. In fact, 
certain ones are treated by only a few texts, 
The frequency totals more than 16 (the num 





West 219 


ber of texts analyzed) in some cases because 
of the practice of some authors of using two 
or more symbols to express the same concept. 
As previously noted, some texts avoid the 

use of symbols and denote the term verbally. 


LIST II 


Variations of Notation for Certain Terms as 
Found in Sixteen Textbooks in Educational 
Statistics, With Their Frequency of Occurrence 


> M (5), x (3), § (1), 
et verbal (3). 
specific number of cases ... N (15), n (2) 
(14), S (2), T (2) 

eee eof (15), F (8) 
cumulative frequency...f, ‘2 » Cum. F. {2} Cum. 

f { ; cum. f. (1), ver- 
3 


(2), I {3} 
2), int. (1 
» ed (1), CI 
verbal (6) 
» verbal (9) 


(33? 


{i 


measure 


bal 
number of units in interval....h.k. 


: » Ge. (2), GA 
1), Ma ye Ass. M. 

Est. M (1), M 

(1), Arbitrary 


y"Briein (1), 
verbal (1) 


Ae D (2), d (4), x (1), 
mean (in intervals)..d (9), d' (1), x (1), x? 
(1), D (1) 


deviaticn from mean 
(in units). 


assumed mean (in ‘ 
senencesec onset (1), X (1), 4 (3), € 


assumed mean (in 
intervals) .......++.d i, BG) e (1), ¢€ (1), 


x (1 
correction (in units)..C e (4), m (1) ci (1), 


correction (snvertals).c (7); ce’ (1), me (1), A 
Md (9), Mdn (4), ued (2) 


- Md. (2), Me »u 
2 (1), verbal 

lower limit of iineeealae ocoukee. Cate & : » LL. 

1), v (1), verbal 


upper limit of interval.....uf (1), u (1), U.L. 
1), v' (1), verbal 
8 


ate. (3) G (1) 
te); ib. Ne) 
-D. (1), verbal 
(15), U.Q. (1), Pr 
“PHS, 
standard error of oemen whit? Z ce 3° (1), 


standard error of oni a 
ard deviation........0, (3), &, (1), 
Sdeviation in ¢ (1) 


geometric BORRe a00eoeell Psy 
harmonic BOGMc ccccccoceel 

mean deviation.....+++A be 
quartile deviation.....Q (14), Q 


upper quartile.......+-Q 
number of measures below....Np (5) ° 
F 








JOURNAL OF EXPERIMENTAL EDUCATION 


List II (Continued) 
probable error of difference 
1M MEANS. .crccccessevessees P.E. air. (5), 

P.E. (aire) (1), P.E.p 
(1), P-E.yy - iy (1), 
P.E. diff. my - my (1 

difference in ranks.........D (9), 4 (1), y- vy 

(1), vi- ve (1) 
o000G £833 g (6) 
, 


gains in ranks 
16 R (1) 


coefficient of correlation..r 
Spearman rank--difference 
COOfficient.cccccccccccecce e(il), r (2), 
sums of product moments Oxy 1») a 
, 
dy Dfxy dy (1) 
XK 63} n (1) 
1), Cor lL 


) 


ti 
—" 


number of classes... {3} 
coefficient of wenden % as 


standard error of estimate..S (2), Gest. (2), 
o (est.) (1), 9.2 (1), 
Oyx (1), SEE (1) 
standard error of measure- 
paceecoscocoeseces o+00D peas. (2), Fe (1), 
Fest. + obt. (1), om (1), 
Sl.@ (1) 
coefficient of attenuation..rge (1), rap (1), Puy 
(1), Too (1), : 
l xy corrected for attenuation (1) 


predicted correlation for 


test n times as long eTnan (1), ru (2), rn 


(2), Tea (1) 
coefficient of multiple cor- 
relation ceeeeeeeeeseeRy.os (4), Ry (23) (1), 
Pi. 28 (1), Re-iz2 (1), 


Pc. yp (1) 
constant in a regression 
<< {3}. e (1), K (1), 


In examining the items of List II one is 
struck by the great variety of symbolism ap- 
plied to the more common terms, such as ‘neas- 
ure" with six variates, "number of units in 
the interval" with seven variates, "mean av- 
erage" with thirteen rather distinct vari- 
ates, “assumed mean" with ten, and "median 
average" with seven. In the case of the mean 
the symbol T. M., the notation AViopt.) OF 
Mactual) » 211 indicating True Mean are used 
chiefly in connection with the process of 
finding the Mean Deviation. No clear reason 
for such deviations of terminology or nota- 
tion can be given, and they doubtless con- 
fuse rather than aid the learner. The same 
is true of T. Md. standing for True Median. 

In the matter of indicating deviations 
from the mean there is the greatest confu- 
sion. Unless the reader analyzes the text 





Volume I, No. 3 


in detail he can hardly be certain which type 
of deviation is indicated in a particular 
situation. Those notations having to do 
with standard error indicate uncertainty on 
the part of the authors who include such com- 
putations. In the majority of cases the 
small sigma with a subscript is used but there 
is a good deal of variation in the subscript, 
In other cases some other symbol than small 
Sigma is used. The small sigma with a sub- 
script capital M is used by one author as 
the symbol for the standard error of the 
mean and by another as the standard error of 
measurement, 

It is evident that the notation for some 
of the terms is becoming somewhat standard- 
ized in the sense that a certain symbol is 
used in all or at least a majority of the 
texts. Such is the case with N for "number 
of. cases in a specific series"; Lfor "sun"; 
f for “frequency”; r for “coefficient of cor- 
relation by the product-moment method"; Q 
for "quartile deviation"; Qs for "upper quar- 
tile point"; and y, for the "mean ordinate 
of the norml curve." The most frequent 
usage cannot be taken to mean ideal usage, 
however. For example, if M is eventually se- 
lected as the symbol for the mean average, 
M.D. would appear to be the logical notation 
for mean deviation, even though the majority 
use A.D. 

The following quotation from Otis indi- 
cates, not merely the difficulty which an au- 
thor meets in selecting a system of notation 
but also the factors which influence him in 
the selection. "The term average is con- 
sidered by certain authorities as referring 
to any measure of central tendency, while the 
term mean refers to what is commonly known 
as the average. In this case, however, it 
is convenient to use the term average devia- 
tion, since the abbreviation Avg. Dev. (some- 
times written A.D.) is not likely to be con- 
fused with Med. Dev, (sometimes written M.D.). 
Since M.D. might be understood as standing 
for mean deviation it is safer to write Med. 
Dev."1 In the face of such confusion it is 
no wonder that some authors adhere to the 
full verbal form, 

In List III formulae for mean and median 
are given. These illustrate rather vividly 





l. A. S. Otis, Statistical Method in Educational Measurement (Yonkers-on-Hudson, New York: World Book Company, 1925), 


footnote p. 90. 





varch, 1933 


the difficulty which the beginner encounters 
in referring to other texts, even in the sim 
pler and more common computations. Even though 
the formulae agree in general form and in 
certain specific symbols, they differ marked 
ly in other respects. No two agree in toto, 
The awkwardness of the formula for the medi- 
an as expressed in verbal terms is apparent. 
It may be noted that those who resort to full 
verbal explanation list the steps of proce- 
dure without assembling these in a formula. 


LIST III 
Variations in Formulae, With Frequency 


The Mean Average - grouped data 
Long Method: 
rrx 


cee, a); FS 0) 


rf: 
verbal (4); _ (4); N 


Ufx 

=, (2) 
n 
Short Method: 


Est. M +2fd , int. 


(1); 
ox int, (1); 


verbal (4); A + Fn, (1); 


ufda 


rf 
GM + y? (2s AM + 


XD(FD) (algebraic) 2 hesth of 
N step, (1); 





GA + 


GA + (c.4) (ES), (1); GM +Efedr, (1); 


ILfE 
Mm, + n , (1); 


Dra (alg.) . 
M (approx.) * —— a x s, (1); 
(Class interval)LE 

N 


» (1); 





Arbit. Orig. + 


estimated mean oe x number of units 


in the inter- 
val (1). 


The Median 


verbal, (8) 


i, (1) value at the lower 


a 
edge of the interval + I> (1) 


P. V. West 








221 


Lower limit of Be Partial sum 


middle class + 





x Class int. 
(1) 


Class frequency 


Lowest point, , No. of cases to median 
( in interval? No. of cases in interval 


interval, (1) 





cases to go x 
cases in interval 


size of 


bottom of interval + 
int. (1) 





f yet to be used 
f of mid-interval 





x size of 
int. (1) 


bottom of mid-int. + 

It has been frequently noted that the next 
step to be taken in educational research has 
to do with the refinement of technique and 
procedures. Statistical methods are destined 
to play such a large part in the educational 
research of the future, it would appear logi- 
cally essential to determine the exact lan- 
guage which statistics will use. It is dif- 
ficult enough in any case to initiate profes- 
Sional bodies of educators and specialized 
research workers into the mysteries of sta- 
tistics, but the process is increased great- 
ly in difficulty when no two texts or author- 
ities in the field speak the same language, 
The present status is somewhat akin to that 
which would exist if a student of shorthand 
tried to learn the Gregg System by consulting 
and studying a large number of texts, each 
of which has been produced by an author who 
has made his own extensive modifications in 
the basic word signs. There would be mani- 
festly little community of understanding a- 
mong those who had studied single treatments 
which were diverse, and the utmost confusion 
in the art of stenography would obtain. 

On the basis of the considerations here 
given it is urged that steps be taken in the 
near future toward the standardization of 
the terminology, symbols and formulae in ed- 
ucational statistics. It may be argued that 
adequate standardization will take place 
normally through the precess of natural se- 
lection. This may be true if we are willing 
to wait and in the meantime endure the trial 
and error process that now obtains. The 
Joint Committee on Standards for Graphic Pre- 
sentation performed a valued service in their 
day for the entire statistical field. A 





222 


Similar committee should function in estab- 
lishing standard nomenclature and notation, 
Some may profess to fear the inhibiting 
effects of standardization upon the rela- 
tively new field of educational statistics, 
thus preventing creative activity and inven- 
tion in the future. It appears that there 
has been too much free invention already, in- 
sofar as true pedagogical interests are con- 





JOURNAL OF EXPERIMENTAL EDUCATION 


Volume I, No. 3 


cerned. No committee should take upon it- 
self the task of determining the specific 
topics which should be included in any treat. 
ment, nor the detailed ways in which the 
topics should be presented. On the other 


hand, no author should have the right to fur- 
ther confuse the reader and especially the 
beginning student with his unique and some- 
times peculiar adaptations and devices, 








STANDARDIZATION OF STATISTICAL SYMBOLISM2 
Walter S. Monroe 
University of Illinois 


The letters and other symbols used in 
formulae and in descriptive statements are 
instruments of convenience. Verbal phrases 
would make most formulae entirely too cum 
persome. AS in algebra, X and other symbols 
are used to represent a quantity or even a 
group of quantities. Since statistical sym- 
bols are used as instruments of communica- 
tion as well as of personal convenience, un- 
iformity of meaning is highly desirable, if 
not imperative, for those most commonly used. 
It is also desirable that the symbols con- 
form to a system. Unfortunately, our symbc- 
lism has developed without much attention to 
general principles. A few principles, how- 
ever, are rather cenerally observed by the 
more authoritative writers. 


Designation of data: Test scores and 


other measures, A group of data, such as 





the scores made on a test, chronological 
ages, school marks, intelligence quotients, 
and the like, commonly thought of as repre- 
senting values of a variable, is usually des- 


ignated by the symbol, X. A second variable 
may be designated by Y, but a number of writ- 
ers prefer to attach numerical subscripts to 
X to designate the several groups of data or 
variables. This practice has the advantage 
of being capable of extension to any number 
of variables. When only two sets of data 
are involved the most common designations 
are X, and Xp, but there are certain advan- 
tages in using X, to designate the variable 
that is considered criterion or dependent. 
Thus the independent variables would be rep- 
resented by X,, Xp, Xg..-X,~ 

When the raw data have been transformed 
so that they are expressed as deviations from 
their mean as the zero point small letters 





are used instead of capitals. If they are 
further transformed so that they are ex- 
pressed in terms of their standard deviation 
(o) as a unit, 2 is used as the symbol. When 
the data are expressed from an arbitrary 
zero point, such as an assumed mean, a prime 
(') is attached to the symbol. A bar (") a- 
bove a symbol indicates a value estimated by 
means of a formula, usually a regression e- 
quation, An estimated value usually has the 
meaning of "most probable” value. 

The number of measures (cases) in a group 
or population is commonly designated by N. 
Small n is used to represent the number of 
variables or sets of measures, When two 
groups or populations are involved, N des- 
ignates the larger and an n the smaller one, 

The results of calculations. The results 
of certain calculations such as those de- 
Signed to obtain a central tendency or a 
measure of relationship are commonly desig- 
nated by a single letter: M for mean, r for 
Pearson product-moment coefficient of corre- 
lation, o for standard deviation, etc. There 
are a few exceptions of which the use of Md 
for median and PE for probable error are per- 
haps the most important. 

The use of subscripts. It is usually de- 
sirable to connect a symbol designating the 
result of certain calculations with the group or 
groups of data from which it was obtained. 
This is accomplished by means of subscripts. 
For example, rp, indicates the coefficient 
of correlation was obtained from the meas- 
ures of variables, X, and X,. 

New symbols. When a writer encounters 4 
need for new symbols he should be guided by 
two general rules: (1) Avoid, if possible, 
the use of a letter or other symbol that has 














l. This article was motivated by the realization that variation in symbolism, even when the author is careful to give 
the meaning of each symbol used, increases the difficulty of reading the more technical reports of educational re~- 


search. 


A preliminary list of the commonly used symbols was compiled and a mimeo 


copy sent to the other mes- 


graphed 
bers of the Editorial Board of the J of imental Education and to each of the Contributing Editors. In the 
accompanying letter the question was raised w — in selecting and editing manuscripts for publication in the Jour- 


nal of Experimental Education we should give attention to the symbolism used by the authors. 
In response to the further question, whether a list of symbols for the most 


question were affirmative. 


The replies to this 


unanimously 
frequently occurring quantities or terms should be compiJed and conformity to it should be insisted upon, several of 





een a on of 
pene «en | BH = 


> ai 


- 
» 


224 JOURNAL OF EXPERIMENTAL EDUCATION 


any considerable usage for another purpose. 
(2) Select a simple symbol. In general use 
a single letter, with an appropriate sub- 
script if necessary. When two letters are 
used, omit periods, 


A SUGGESTED PARTIAL LIST OF SYMBOLS 


Note. For the most part this list has been 
restricted to symbols strongly supported by 
current practice. In a few cases a symbol 
not widely used is given because it appears 
to represent a desirable practice. No at- 
tempt has been made to supply symbols for 
all statistics. It is possible that stand- 
ardization of symbols for seldom used sta- 





Volume I, No, 3 


-~M 
critical ratio. OR = Pi. See 


Correction, difference between as-. 
sumed mean and exact mean. As a 
subscript before 0, and rj, indi- 
cates that a correction for coarse 
grouping has been employed. 


Difference. The quantities sub- 
tracted may be indicated by a sub- 
script. For example, Djo-9 desig- 
nates the 10-90 percentile range, 


Deviation from the mean in terms 
of class or step intervals. 


tistics is not desirable, but doubtless this 
list should be supplemented from time to at 
time. The author will be glad to receive 
criticisms and suggestions. 


Deviation from an assumed or arbi- 
trary origin. 


x womes Dells js ees tearm ames“ = - 





+ 
7 | 


Agee Coefficient of determination used 
AA Achievement age, or synonymously in connection with path coeffi- 
accomplishment age or attainment cients. 
age. 
rs. -A coefficient of determination 
AD Average deviation, or mean devia- measuring the joint effect of var- 
tion. lables x, and x, on the variance 
of Xj 1° 
AQ Achievement quotient. 
E Used as a subscript designates ex- 
AR Achievement ratio. perimental group. Sometimes used 


as a symbol for “efficiency of pre- 
Dip Regression coefficient, involving diction." See I,. 
r (Pearson) unless otherwise spec- 


ified, of X, (dependent) on Xz (in- | e Designates the base of the Naperi- 


dependent) or x, on Xp. (Die an system of logarithms and equals 
=r. £3). 2.71828... Also used to designate 
% the error in a measure, 
Dn Regression coefficient of X, on X; 
or X, on X,. (bg, = Tig) var Variable error of measurement. 


Dic.sa...n Partial regression coefficient of | 6,,, Systematic error, 


X, on Xg, all others of Q varia- 








bles constant. EA Educational age, expresses stand- 
ing in a number of school] subjects. 
Cc Constant in a regression equation. 
Used as subscript designates con- | EC Experimental coefficient proposed 
trol group. by McCall, EC =Dih-Ma , See CR. 
2.780) 
CA Chronological age. EQ Educational quotient. 
(Pootnote continued) to 


the respondents brought out the point that, although uniformity was desirable, it probably would not be possible 
formulate a standard list of symbols at this time, and also that in any case it might be wise to permit some degree 
of variation in the case of symbols not frequently used. 





varch, 1933 


f 


Frequency within an interval, 
Gain. 


Improvement over pure guesses, or 
efficiency of prediction of fal- 
lible measures of the criterion. 


I, = 100 (1-/Yl- ~) 


P 1 


Improvement over pure guesses, or 
efficiency of prediction of true 
measures of a criterion. 

Inn = 100 (1-Yryy - rip) 


Intelligence quotient. 


Width of class or step interval 
in scale units. 


A constant. (Not used in regres- 
sion equations. ) 


Kurtosis. 

Coefficient of alienation. 
k=/l- r=. When squared is the 
coefficient of non-determination. 
See r for subscripts and their 
meaning. 

Multiple alienation coefficient. 
When squared is the mltiple co- 
efficient of non-determination. 
Partial alienation coefficient. 
Mean. 

Mental age. 

Median. 

Median deviation. 


Mean deviation. This symbol is 
not recommended. Use AD instead, 


Mode. Sometimes designated by Z. 


Number of variables when) is 
used in same forma, 


Total number of cases or observa- 
tions. 





W. S. Monroe 


n 





225 


Number of variables. Also number 
of alternative responses to an i- 
tem on a miltiple response test. 


With subscripts from 1 to 100 des- 
ignates a percentile point. For 
example, P,;, designates the tenth 
percentile, the point on the scale 
of a frequency distribution below 
which 10 per cent of the measures 
fall and P,, designates the nineti- 
eth percentile, the point on the 
scale of a frequency distribution 
below which 90 per cent of the 
measures fall, 


Probable error, median deviation of 
the distribution when it is one of 
errors. PE = .6745 o where o rep- 
resents the standard error. PE is 
sometimes used incorrectly for MdD 
where the distribution is not one 
of errors. 

For the use of this symbol with 
subscripts to designate the prob- 
able error of particular quantities 
see o (standard error). 


Probability of success, or of per 
cent of cases in a given category. 
p=l-q. 


Symbol for path coefficient. 


Quartile deviation. Sometimes 
called semi-interquartile range. 


qg= Su, 


First (lower) quartile point. 
Q, = Poss 


Third (upper) quartile point. 
Q, = Pos. 


Probability of failure. q=l1- p. 


Multiple correlation coefficient. 
When squared it is the coefficient 
of multiple determination. 


Pearson product-moment coefficient 
of correlation in a theoretical 
range of talent. Usually this 
range of talent is larger than that 

















. Coefficient of reliability, 


for r;,. R is used as a symbol for 
the coefficient of "rank correla- 

tion," but this statistic is sel- 

‘dom calculated. R is also used to 
designate the "number of right re- 
sponses." 


Pearson product-moment coefficient 
of correlation. Subscripts are 
used to indicate the two sets of 
paired measures or variables whose 
correlation is being expressed. 
The symbols X,, X,, etc., designat- 
ing the variables may be used as 
subscripts but usually only the 
subscripts of these symbols are 
attachedto r. Usually no signif- 
icance is attached to the order in 
which the two subscripts are writ- 
ten, but when it is desired to i- 
dentify the dependent variable 
(Y-ordinate in correlation table) 
the first position may be given to 
the subscript desispnating this 
variable. A few of the more com 
mon cases of special subscripts 
are given below, 


the 


subscripts designating two meas- 
ures of the same thing. 
write r 


We also 


ex* 


Coefficient of correlation between 
the two halves of a test. 


Correlation between obtained and 
theoretical true measures, This 
symbol assumes that X, represents 
the obtained measures. If another 
symbol is used to designate these 
measures the subscript "," would 
be changed accordingly. When X, 
designates test scores r,, is 
read "index of reliability." 


Coefficient of correlation cor- 
rected for attenuation, 


Coefficient of partial correlation. 


Sometimes used to designate summa- 
tion. 


See = 





JOURNAL OF EXPERIMENTAL EDUCATION 





SD 


sys 


Sk 


Ciose 


X and Y 











Voiume I, No,2 






Standard deviation, but this symbo) 
is seldom used. See o, 





Used as subscript to indicate er. 
rors due to random sampling. 






Used as subscript to designate 
systematic error. 












Skewness. 











Symbol for tetrad difference. 
= Tielse ~ TislT cae 











ios, 





Coefficient of variability; a 
measure of relative dispersion. 
Often written C. of V. 


Used as subscript to designate var- 
iable error, 


Are used to designate raw or ob- 
served measures in two series of 
measures, or X, and Xp, may be used, 
If there are several series of raw 
measures these may be designated 
Xi» Xp, Xg,---X,. Small letters 
X, Vy X,» Xp, Xg and so on repre- 
sent measures expressed as devia- 
tions from the means of the corres- 
ponding raw measures. (See intro- 
ductory statement. ) 





Theoretically true measures of the 
variables X and Y, the symbol, in- 
dicating that they represent aver- 
ages of infinite numbers of meas- 
urements of X and Y, Small let- 
ters x andy represent true meas- 
ures expressed as deviations from 
their respective means, 


Symbol for true criterion measure. 


Measures expressed in terms of the 
standard deviation as a unit. 


Standard measure, i.6., a measure 
expressed from the mean of the 
distribution as a zero point and 
in terms of the standard deviation 
(oc) as aunit. Z= YU =—Mi, The 
symbol z is also used &s the 









March, 1933 


ordinate of the normal probability 
curve having unit area and unit 
standard deviation. 


Eta, the ratio of correlation; 
measures curvilinear correlation. 


Summation, the sum of. Occasion- 
ally S is used for this purpose 
but except where special designa- 
tion is necessary, 5 is preferable 
as a symbol. S is sometimes used 
to indicate standard deviation of 
theoretical or large range. 


The sum of the measures for indi- 
viduals, 1 to N inclusive 


Sigma, the standard deviations of 
a distribution. o= - When 
the distribution is one of errors 
o is called the standard error. 
The particular distribution is in- 
dicated by subscripts. A few spe- 
cial cases are given. 


Standard error of a mean, usually 
interpreted as the standard error 
due to random sampling. When 
there is any possibility for con- 
fusion, sub-subscripts should be 
used, 


Standaru error of a mean due to 
random sampling and to variable 
errors of measurement. 


Standard error of a mean due to 
variable errors of measurement. 


Standard error of a mean due to 
random sampling. 


Standard error of a median. 


Standard error of a coefficient of 
correlation. 


Standard error of a difference, 
When desired the subscript D may 
be replaced by the difference it 
represents. Hence dy, - y, would 
indicate the standard error of the 








W. S. Monroe 


oo *254...0 


227 


difference M, - Mp. When necessa- 
ry to avoid confusion Ty y, # 0 or 
Tinwe = 0 may be added to the sub- 
script to make clear the formula 
used, 


True standard deviation. 
=O, 7% 


Standard error of estimate--stand- 
ard deviation of the differences 
between X, and the estimates of 

X, (X,) made from X, by simple re- 
gression equation, %, = ryp—* Xp 

+ C. Hence we might write % -% 
but this would not be a convenient 
Symbolism. Hence we use 0j., in 
which » indicates the basis of es- 
timate and , the basis of compari- 


Standard error of estimate of Xp, 
the predictions being made from X; 


by means of regression equation. 
02.1 = V1 - Pip. 


Standard error of measurement where 
x, is taken as evidence of x,, or 
X, is taken as evidence of X,. Lit- 
erally the standard error of meas- 
urement is the standard deviation 
of the difference X, - .X, or 

X; — oX, (variable error of meas~ 


urement). 0).. = G/YI = Pj; 


Standard error of estimate where a 
regressed, or estimated true score, 
X,, is taken as evidence of the 
true score X,. (X,, Xs, X» repre= 
sent the same function). Ga.1 


“-n ~ "Bs 


ol 


Standard error of estimate of X, 
computed fram regression equation 
involving independent variables Xp, 
Xs, X,e..X,- Same symbol is used 
when the variables are expressed 
from their respective means as zero 
points, 


Standard error of estimate where X, 
is taken as evidence of true cri- 
terion measure X,, multiple 





=, oh 


Smee Suite yet wore ae ge a 
\ ° 7 rhage 


we 


a 


= RAR he 5 ee Ee 


pel nn, eR 
- ——— 


. PA ~w « 
Oe iw. - 


JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


regression equation used with Xp, 
X,, Xg---X, 88 independent vari- 
ables. 


Standard error of a quotient such 
as AQ. 


Infinite symbol, as a subscript a 
true measure of a variable, i.e,, 
the mean of an infinite number of 
measurements. 


Omega, as a subscript, a true meas- 
ure of a second variable. 














SOME CONSIDERATIONS RELATIVE TO THE STANDARDIZATION 
OF CERTAIN PROCEDURES IN EDUCATIONAL RESEARCH 


Herbert A. Toops 


A speaker at an educational conference 
recently stated that fully 95 per cent of the 
researches in one branch of educational psy- 
chology should be discarded because the ex- 
periments were based on so few cases and the 
tests used were so unreliable that the re- 
sults are statistically worthless. The obvi- 
ous remedy is more subjects and longer tests 
or more kinds of tests per subject. When in- 
vestigations are thus enlarged, machine tab- 
ulation of data becomes imperative. Machines 
for punched card sorting and for computing 
statistics have come to stay. They are now 
a fixed part of our research tradition. Al- 
most a score of colleges possess such ma- 
chines, and they are to be found in a few 
public school systems and research bureaus, 

The employment of these machines raises a 
whole host of problems which hitherto have 
received inadequate treatment in the educa- 
tional literature. The machines act blindly, 
mechanically, and automatically. Unlike the 
tally marxer, they are unable to settle each 
exceptional case and each issue as it arises, 
Every contingency must be foreseen and be 
provided for. Inevitably then, the machines 
force one to standardize all his procedures 
--to have a uniform rule, for example, for 
treating all those exceptions which by hand 
treatment, would be settled and then reset- 
tled, perhaps, several times in the course of 
a research, To the machines every research 
job is a routine job; no job is "new." Both 
quantitative and qualitative data, perforce, 
must be treated exactly alike. This require- 
ment forces upon us some efficiencies of 
method which it were well to use even if the 
machines had never been invented. 

In the process of producing Hollerith 
cards all the data after punching must be 
verified by running the cards through a 
verifier; and even before the cards can be 


Ohio State University 








punched, ordinarily each questionnaire reply 
must be transmuted into a quantitative score 
or code number. To facilitate punching and 
verifying this number should have been pre- 
coded and printed on the original question- 
naire or data-gathering medium. Sucha form 
of question is the following one regarding 
marital condition: 
"Marital condition; check (X) 
Single ieee Fe} 
Married CC) (2) 
Widowed CJ (3) 
Separated () (4) 
Divorced [-__] (5)" 
Such codes are made up arbitrarily, due con- 
sideration during their construction being 
given to the conditions under which the an- 
Swers are given as well as to the convenience 
of the operator in punching the coded scores 
corresponding to the "vote" recorded by the 
respondent. It should be obvious that what 
is a necessity in machine statistics will 
effect relatively an even greater saving when 
hand tabulation is employed. 
In punching cards for quantitative data, 
a system of coding is frequently necessary. 
When this has been done the computations deal 
with small positive integers or step scores, 
X’'s, in place of the more unwieldy raw 
scores. By introducing the concept of nega- 
tive class interval we are able to handle 
statistically all possible cases of coded 
scores used to stand for or to replace orig- 
inal recorded scores. The required formla 
is: 
Xy = (Fy = Fy) X’ + Fy (1) 
Where, 
X’ is the coded score correspond 
ing to the face-value, Xs, of any class 
in such @ table as the following: 





1. A revision of a paper delivered before the Educational Research Association, Detroit, February 21, 1951. 

















JOURNAL OF EXPERIMENTAL EDUCATION 


Class 
40-44 
35-39 
30-34 
25-29 
20-24 


F, is the face value of the 
class which corresponds with step 1; name- 
ly, 27, in the above table. 

F,, is the face value of the class 
which corresponds with step 0; namely, 22, 
in the above table. 

For the above table, equation (1) 
ingly becomes 
Xp (27-22) X’ + 22 


accord- 


Xp = SX’ + 22 (2) 
Now, having the equation, we my take the 
face value of any class at all and note that 
the relationship holds. For example, where 
xX’ = 4, the value of (5X’ + 22) is 42 which 
agrees exactly with the face value of the 
class 40-44, For coding quantitative scores 
for card punching we need to know the X’ 
corresponding to any given X, if possible, 
without the necessity of looking at such a 
table as we have presented above. 

For this purpose, it is desirable to be- 
gin the recorded score of each class with a 
multiple of the class interval, as has been 
done above. By solving formula (2) literal- 
ly for X’, we have as a result the mental 


transmutation formula, 


Xp - 22 
5 
which may be written as! 


Xx’ = 


(3) 


" xX 
aie |=] its 


where the pair of vertical double parallels 
means “throw away any decimals resulting from 
the division." 





Volume I, No, 3 


By beginning each class of such a trans- 
mutation table with a miltiple of the class 
interval it will be observed that the trans- 
mutations readily may be done mentally. For 
example, the raw score 34 divided by 5 = 6 
(decimals being dropped) minus 4 =a trans- 
muted score of 2, The rule for mental trans- 
mutation is simply: "Divide the score in 
question by 5; drop all decimals; subtract 
4 from the result; the remainder is the cor- 
responding transmuted score.” One can run 
down a column of data, transmuting as he 
goes about as fast as he can record the 
transmuted scores in red ink directly above 
and to the right of the original data. It is 
an advantage to record the original data 
systematically in columns on paper leaving 
alternate colums blank to receive the trans- 
muted scores, 

With such transmiting equations available, 
results of computations performed on the 
steps are readily changed over to read in 
terms of the original X-units, 

The only problem in this connection arises 
with missing data, On Hollerith equipment 
it has been found that, generally speaking, 
missing data should be punched into position 
O of the card, (letting a true zero become 
a coded 1) thus requiring a definite cate- 
gory, "missing data," to be recorded for each 
case of a missing score. 

For a transmutation code such as the fol- 
lowing, 

Class ad 
40-44 
35-39 
30-34 
25-29 
20-24 


the mental transmutation formla becomes, 
(since the class interval here is negative; 
-5), X, = 40 


Yee 5 


-5 
| 


or 


XxX” = 8- (4) 





1. Xp may, in this case, be represented as Xp = (Xi, + 2), where X;, = face value of lowest recorded score of the class. 


Accordingly we may write 


«J 


~ (et 2) - 22 _ W- 2_ 


(3) 





=~ = Isl-« 


Now since all the scores 21, 22, 25, 24 yield transmuted scores of zero, the device of the parallel bars provides for 
obtaining the same integral quotient from all these scores. From formula (5) it will now be readily apparent why the 
lowest recorded score of each class should be an exact multiple of the class interval. 








March, 1933 


Or, mentally, "Divide the score by 5; dis- 
card all decimals; subtract the remaining 
quotient from 8; the remainder is the desired 
coded score." 

Let us now consider the relation of X’ to 


xX”. 
, ’s 


Class xX’ xX 
40-44 4 O- 
35-39 3 1 
30-34 2 2 
25-29 1 3 
20-24 0 4 


By an analogous process it is possible to de- 
termine that the relationship of the X” se- 
ries of transmuted scores to the X’ series of 
transmuted scores is 

X” = -X' +4 
whence checking formulae are readily derived, 
Computations done by one set of the trans- 
muted scores may readily be checked for 
arithmetical consistency or accuracy with a 
second set of computations made by employing 
the other set of coded scores, and vice versa, 
Two such checking equations are: 


(5) 
(6) 


EX” = -cxX’ + 4N 
r(x”)® = O(x’)® = BEX’ + 16N 


where = means "must equal” or "must check 
with." 

One of the first problems which one en- 
counters in utilizing tabulating machines 
then is that of ascribing to each individual 
of our research @ unique serial number. The 
problem, in its general case, *is roughly 
this: Given, say, 50,000 people whom one 
must identify absolutely by a serial number 
(necessary for follow-up) in order to besure 
that the later data are credited to exactly 
the same individual as in the original, how 
can this serial number be given at the time 
of application, interview, enrollment or 
otherwise to such en individual as to meet 
the following requirements: 

1. So that the probability of assigninga 
given serial number a second time to any 
other individual will be as near O as possi- 
ble, even though interviews, or applications, 
or enrollments, may be going on in several 
cities or centers at the same time. 





H. A. Toops 











231 






2. So that the serial number shall be 
(preferably) a meaningful number rather than 


a@ mere number without sense. The ends to be 
secured here are: (a) possibility of the 
subject restoring the serial number, if lost, 
by means of his own answers to certain ques- 
tions, and (b) the effecting of a saving of 
columns of the card, thereby leaving more 
space for other variables, 

3. So that the series shall be indefinite- 
ly expansible,. 

The first of the above requirements is 
secured in some states! annual distribution 
of auto tags, by allotting a certain set of 
numbers to each locality or distributing 
center. If that entire series of numbers 
is not distributed, there will be some va- 
cancies in the series; and, on the other 
hand, if there are more demands than tags 
originally supplied, the locality will have, 
after new tags are supplied, two or more dif- 
ferent series of numbers, with intervening 
numbers absent. The second requirement has 
met with no widespread attempt at solution 
known to the author; while the third is the 
ever present problem of every librarian and 
file clerk. 

It would seem that the problem is sus- 
ceptible to statistical treatment because of 
the fact that if we take two or more two- 
digited, and preferably uncorrelated vari- 
ables, a great many persons can be repre- 
sented by the code number resulting from writ- 
ing as a continued string-on series the sev- 
eral scores of a given person in the two or 
more specified traits in order. For example, 
if we take three traits, absolutely uncor- 
related, each measured in centile ranks of 
the group in question, we should have one 
million “different kinds" of people. The 
Serial number 682527 would identify a person 
who wus 68 centile in trait 1; 25 in trait 
2; and 27 in trait 3. In any small finite 
group, such as 40,000, the probability of 
duplication of any serial number would be 
roughly only .04; in other words, the chances 
are 25 to 1 that a duplication will not oc- 
cur in 40,000 persons. In any event we 
might allow one extra column for such a con- 
tingency, as is done in the Dewey decimal 
System. The Hollerith card makes adequate 
provision for individualizing 10 possible 
duplications, If the variables entering into 











such serial numbers could be standardized, @ 
great gain for efficiency of research would 
result. We might well consider whether these 
serial numbers should not be those variables 
which it ‘s of greatest importance to keep 
constant in educational research if we are 
to obtain an approximation to the condition 
of "the single variable": sex, race, family, 
maturity, etc. At the present time, prob- 
ably no one would care to say authoritative- 
ly what these variables are, 

Let us now proceed to another serious 
problem which research has to contend with, 
namely, the lack of arithmetical accuracy in 
published research reports. An appreciable 
proportion of research reports fail to give 
evidence of the simple computational accura- 
cy which may reasonably be expected of such 
work. In the first place, this problem can 
be met by means of checking formulae, such 
as are referred to above. In the second 
place, it can be met by the expedient of hav- 
ing the research worker publish his original 
data -in order that others, desirous of s0 
doing, not only may hold the author respon- 
sible for arithmetical accuracy, but also 


may themselves check the accuracy of his 


computation. Any Ph. D. thesis or disserta- 
tion might, as a matter of custom, be re- 
quired to include the publication of the 
original data. At the present time, however 
such a prescription would be rather too ex- 
pensive. It is therefore suggested that a 
series of single-digited linotype matrices 
for the numbers 01 to 20 inclusive be de- 
veloped by means of which the original data 
can be published in a single-digited coded 
scores derived analogously to formla (3) 
page 230. It will be noted that an original 
transmutation table can be fully represented 
by publishing the class interval and F,,. By 
means of these two constants anyone readily 
may reconstruct such a transmutation table 
in full, 

By means of such single-digited type, 
original data might be published some 20 to 
30 columns to a page and some SO cases deep. 
It would seem not unreasonable, therefore, to 
require original contributors to print their 
original data in this coded form. Such 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


printing would have the advantage not only 
of making the author responsible for his con 
puted results, and of enabling any other in- 
vestigator to check his work; but also it 
would enable other research workers to sub- 
ject the original raw data to new, differ- 
ent, revised, or improved techniques. At the 
present time, not infrequently when 4 re- 
vised technique is devised and the original 
author is approached with respect to cooper- 
ation in trying out the new forma on the 
old data, it will be found that he has mis- 
placed the original data. The type of pub- 
lication suggested would make such losses 
impossible. We will suppose that in the in- 
terest of reducing expense it might be deemed 
good taste to print such original data from 
a zinc plate made from typewritten copy, 
which would mean that the only necessary ex- 
pense would be a typewriter with the neces- 
sary twenty characters. Each university do- 
ing research work and issuing original data 
would need to be equipped with such a_ type- 
writer. The desirable characteristics of the 
characters, standing for 01 to 20 inclusive, 
perhaps need to be worked out through coop- 
erative thinking before universal adoption. 
The important problems in connection with 
coding qualitative data are the problems of 
(a) definitions of categories, and (b) the 
construction of standard precoded classifica- 
tion tables which we may refer to as standard 
codes. The two problems may be illustrated 
by the question, "Just how many marital sta- 
tuses are there; and just what is the best 
series of code numbers to be placed in front 
of the categories finally decided upon?" In 
the public mind marital status is as follows: 
(1) single, (2) married; while a little 
thought will add "widowed" and "divorced." 
That revises the code to (1) single, (2) mar- 
ried, (3) widowed, and (4) divorced. For many 
purposes, however, the "separated" must be 
taken into account. This results in a five- 
fold code: (1) single, (2) married, (3) wid- 
owed, (4) divorced, (5) separated. Then, for 
other purposes, "engaged" may be a definite 
status worthy of being taken into account. 
On a "quantitative" scale "engaged" would 
probably occur somewhere between "single" 





1. Professor C. C. Peters of Pennsylvania State College is experimenting with lithoprinting. 52 8 1/2 x 11 thesis pages 
on the two sides of an 8 1/2 x 11 sheet. This sise, even without the stereoscopic reading glass with which he is ex- 
perimenting, would be fully satisfactory for the purpose at hand, thus at least quadrupling the capacity of a page as 


here referred to. 





Marck, 1933 


and "married." The resulting scale would 
perhaps be (1) single, (2) engaged, (3) mar- 
ried, (4) widowed, (5) divorced, (6) sepa- 
rated. 

Which of the above standard codes is used 
for a specific purpose will depend upon that 
purpose, Consequently it follows that we 
probably need a number of standardized codes 
to be known as Marital Code Number 1, Mari- 
tal Code Number 2, Marital Code Number 3, 
etc. If these were published in book form 
it would be possible for any investigator to 
refer by number to the standard code em- 
ployed, without being under any personal ob- 
ligation for duplicating in his research re- 
port the codes used. By common consent and 
usage the meanings of the code in time would 
be standardized. 

As another case in question, "What are 
the religions?" The question here is very 
much more indefinite than in the case stated 
above. The religions are of no particular 
spiritual or historical interest to the stat- 
istician. Rather he is concerned with the 
frequency with which people give adherence to 
these religions. The phrase "give adherence 
It is the 





to" may be variously interpreted. 
policy of some people who refer to statistics 
on religions to include under the term "Meth- 
odists" for example, all those who belong to 


the Methodist religion and all those who 
only “give adherence to" or "are allied with" 
the Methodist religion; even if we make the 
distinction between "member" and attendant we 
are still under the difficulty of defining 
"member" and "attendant." Is a member one 
who has not lapsed his dues? Is he one who 
has been baptized but whose dues have not 
been kept up to date? We also note that bap- 
tism is not a requirement for membership ina 
number of churches. We could extend the list 
of possible exceptions to the term "member" 
indefinitely. The same is true for "attend- 
ant." How many times must you attend ina 
lifetime, or in the past year, or in the past 
month, in order to be correctly categoried as 
an “attendant” at, say, the Methodist church, 
or to be allied with the Methodist church? 
Is the best we can do here merely to say "a 
member in good standing" and leave it to 
the judgment of the individual interrogated 
as to whether he is or is not @ member in 





H. A. Toops 





233 


good standing? For some purposes, this may 
be satisfactory; for other purposes it de- 
cidedly is not satisfactory, as, for exam 
ple, in the case of one who is concerned 
with the extent to which the churches carry 
"dead timber" on their church rolls. For the 
general purposes of sociological research, 
however, it seems safe to say that to inter- 
pret "member" as a verbalism, as "a member 
in good standing” or somewhat similar defi- 
nition, would render somewhat comparable the 
results secured. Comparability of statis- 
tics, rather than the exact truth, is the 
end sought; for exact truth can hardly be 
assumed to be a characteristic of categoried 
qualitative terms of this type. Any cate- 
goried term may be subject to various de- 
grees of searchingness as to definition. The 
term "male" and "female" would seem to be, 
in popular opinion, two of the "most abso- 
lute" of categories. Terman has show very 
nicely, however, that masculinity and femi- 
ninity vary on a scale continuum, with the 
concrete result that we may talk about "ms- 
culine males," "feminine males," "feminine 
females" and “masculine females," or better 
still, may designate them by degrees of some 
arbitrary numerical scale in which the scores 
will not seem to have a moral opprobrium at- 
tached as where verbal terms are used, 

From the viewpoint of securing comparable 
results in one of the most important fields 
of educational research, an urgent problem 
is that of determining what are the depart- 
ments (subjects) conducted in the entire 
scheme of schools represented by the kinder- 
garten, the elementary school, the secondary 
school, the college, and the university. Con- 
cretely, is "bird study" ornithology? Is it 
nature study? Is it zoology? Is it biolo- 
gy? Theterms quoted above have different 
degrees of generality, specificness, and in- 
Clusiveness, What shall be the criterion 
for determining the names? Do the terms now 
used in different schools and colleges mean 
anything? Are there thirty subjects of 
study in the public schools or three thou- 
sand? Some agreement in answer to such ques- 
tions would enable research workers in edu- 
cation to secure somewhat more comparable 
statistics than we now possess, This need 
for standardizing definitions in education 

















234 JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 
has been very adequately presented by be held responsible for copying coded infor- 
Phillips.+ mation regarding her job from the directory 
As examples of what may be the practical rather than for furnishing answers necessi- 
outcome of such attempts we present in the tating a great deal of coding at a central 
appendix a few standard codes which we be- | point. In all these matters, education, if 
lieve are reasonably satisfactory:* (1) A it should respond to the challenge here out- 
lined, will but be following in the foot- 


code for alphabetizing names and giving se- 
rial numbers to persons based on their names | steps of a number of other agencies which 


--for positive identification of the indi- | have standardized their practices to some 
vidual and alphabetization by machine of the | extent: The International Association of 
resulting cards; (2) a sample page of anab- | Police who have made a serious effort to 
breviated (not cross-referenced) occupation- | standardize crime categories and to make pos- 
al code in which the census occupations were | sible comparable statistics from police and 
boiled down to an occupational grouping of | criminal records; the Child Guidance Clinics 
99 different occupations--so constructed sponsored by the American Mental Hygiene As- 
that two scorers will objectively give the | sociation, which have standardized their rec- 
same code number to the same reported occu- | ord-keeping; delinquency statistics stand- 
pation--of-parents-as-reported-by-college- ardized by tne National Probation Associa- 
students; (3) a code for recoding occupa- tion; the juvenile labor statistics stand- 
tions, e.g.,for determining the meaning of a | ardized by the Children's Bureau of the U. S, 
machine-produced frequency distribution; (4) | Department of Labor; and the statistics of 
an abbreviated sample page of a religious birth, death and marriage, standardized over 
code (showing the elaborate cross indexing a longer period of time by the U. S. Public 
necessary to facilitate the use of the code), | Health Service and cooperating vital statis- 
With meaningful serial numbers avallable | ticians, state health bureaus, American Medi- 
for every school child of the state, follow- | cal Association and other agencies, includ- 
up programs necessitating positive identifi- | ing the life insurance companies which are 
cation of individuals are readily possible, financially interested in the standardization 
Public school directories could be issued of such statistics. 
and exchanges of these between states would It seems to the author that the topic is 
make it possible to trace the educational worthy of special consideration by a commit- 
careers of individual children from kinder- | tee of some national educational organiza- 
garten to university and to ultimate indus- | tion which would spend considerable time on 
trial employment. When such a child became the problems above enumerated, together 
a teacher in the state school system, his with those dependent problems (e.g., the 
success and failure in kindergarten, high standardization of statistical terminology) 
school and college could readily be allocat- | which would be sure to grow out of any 
ed to him through the existing records, Pub- | serious consideration of the probleme men- 
lished directories of such serial numbers tioned, 
would enable state surveys to be done ata 
very reasonable cost and each teacher could 














1. F. M. Phillips, "Uniformity in Defining, Reporting and Recording Statistical Items," Quarterly Journal of the Ameri- 


can Statistical Association, XXVI (March, 1931), pp. 181-186. 
2. The complete cross-referenced occupations and religious codes will appear in the author's forthcoming: "Question- 


naires, Standard Codes and Hollerith Machines." 











Ma ren, 1933 


a 


H. A. Toops 


APPENDIX 


- 


CODE FOR PARTIAL ALPHABETIZATION OF LAST NAMES 


Code No. 


1 


November 4, 1930 





. 





@e rl 
|é¢ &5\= 


} 


oO 

o 
* 
9 
oS 


gy 8 
gy 8 


| 
| 
} 
| 
a Be Be Bo be Bie 


4 





ay 


i 
Sz Se Se Bole Bel. 
m 





4 


8|S3 Ss|/§s §9\8= S2/~ 


Ez Be 
ty |Ee Self bylty 8 


Pen 
“aA 


Se | 





ne 
sy 


s 8s lbs Eelés Ee 


\3 
ae 
3s 


Ii 
250 
Ji 





85 


1 


Ki 
g20 
Li 
ec 


Gy 
lal 


a) 
e21 


ad 


1j 
25) 
Jj 


280 2l «6? 


K) 
su 


Ly 
34) 


Ok 
la 


« 
252 





Ji | 


Kk 


Tm 


dss 


al 
22s 


Il 


Jl Ja 
235 «284 
Ad 
sue 


wy 
344 


165 | 166 «667 168 





} 


i687 
Fp 


186 


Fo Pq 


oo 0608) 


Dr Ds Oot 
ie 6uo soll 


Es &t 
“oo OM) 
Fs Ft 


Er 
is 


fr 
16s 





+ 

Gn | Go 

195 | 196) «697 8 
| 


In | Io 
255 | 256 257 


Jn | 


+ -- —--—-—-+ 


Kn 
815 


un 


S45 | 6 847 


-— han 


| 
| 


| sis 


| 
| 


Gp Gq 


Jo 
286 


Jp Jg 
287 «288 
Ko Kq 
sls 


uq 
Bae 


Kp 
7 


Lo Lp 


170 ten 
Gr Gs Gt 
i199 «200s aol | 


at 
esl | 
ee 
Is It | 
261 
Jt 
e. | 
+ 
Kt | 
S21 | 


Lt 
351 


Du Dv Dew 
uz us wué 


Bu Ev Ee 
le 64s lest 


ru Fy re 
ive 6178) «(1% 


Gu Gy Gw 
for «8680S (ROS 


fu fv iw 
2s2 62350 (OM 


lv Iw 
2s «(264 





Iu 
262 
Jv 


Ju J 


Ku Kv 
S22 823 


Ke 
a4 | 
Lu Ly Le | 
Ss2 365854 


a 
——— + 


De 


Ex Ei 
1s C«ia CO? 
Px Py Fe 
176—C«almé =o? 
Gx Oy Gs 
26 26 207 


ix hy 
235 «236 


iz 
287 
Ix ly 
m5 267 
Jy Je 
26 0 6T 
Ky Ke 
sae 0827 


ly Lz 
3860 («867 


Jx 
29s 


Lx 
855 





4 


ts) 


If 7 


C 





$5 


Mc- McA McB 
Sl «6«t ME 


u- 
sol 


McI 
370 


Mi 
«00 


McJ 
$71 


My 
«1 


- 
© 


uch 
378 


t 
| 


| 


McO McP McQ 
st 6S7?) 878 


- 
° 


2 


McT 
%1 


ut 
4 


McU McV McW 
a) | 


a2 45 44 


McX McY McZ 
56 6G O87 


Mx 
415 


ty Be 
ae 6417 





&s/&s 
§= |S 


'- Wa 
421 422 


o 
‘ 

— 
o 


Mi 
450 


ot 
eo 


My 
431 


oJ 
“: 


Nt 
“1 


ot 
an 


Nu Nv le 
“a “4M 


Ou 
ave 


Ov Of 


478 


Nx 
44s 


Ny Ne 
“a «47 


Oz 
477 





z\€e2 2/8 


cs 


Ee &3\&e be\Es $2) es Eo 


|e Sz ie s/s 


+ 





? 


Se 
P 
481 

Q- 
su 
qu- 
ba 
R- 


ix 8£|fe &x/8e 


o 
r] 


Pi 
490 


Pj 
4al 


Q) 
821 


¢ &2i8e Ss) &: 


Pt 
sl 


Qt 
$31 


Py 


Pe 


g 
Sz 
§ 


Bo 
a 


~ 





x4 


? 





&2 


$2 


Se | §2 


=» 


Ts 
650 





8s |8z 82/3 tf |Ee Sz\G2 Sz\ 8s 


&83\82 8e\$e EZ Eo &z/6 
&s\f7 82 \3e £f ize & 


8&slBy Be 
8s\8v Be lae € 
8s 


: 
3 
i 
: 
i 


2 
= 


a 


g is 


8% Ss|/fs Es | Er EF |e Sz\ts bx|s 


8% Ss|fe fe| fe &F ke &z\ Be 


Us 
680 


Vans 
mm 7 


&e\ fe Be fz 82 | ge 


g &s\ 2 Be|8z BF ke 
83/82 &s/8r § 


i 
é 


13 





es 8 
8s 


vl 
738 


Les 
768 


vt 
ma 


Vs 
740 


we 
m7 


Ws 
770 





fy \fs = /8 





< 
a 


(Bs ae |ie as |8f BeBe Se \sz ef lee tele Glee BE lex Be 


bs gp\8s 838% Ss\8y Be |Se ff 





xi 
730 


yt 
820 


xl 
res 


yl 
aes 





xt 
601 


Ys 


Yt 
880 «6h 


Si Se\de t7/2 





+ 
Zi 














Ee|Es fr |8s 82/88 &e\Re Be [Se EF Ee kz lke 
Ev /Es dulés gs |88 


Ev |Es Beige 8<|8 


ty |& 





Use of Code - - To 


Name 


code 


fp 8s dei2s és a S| tz 


: 





spell Jones, T. A., 


wi Se te|8e 2 /8f B5|/fe fel fe EF fe fal Se felte F 
| Gs dy) ae #3 /8% Ss\tv Selb BE lee Sxl te fe) te 


Zz 





Jo 


ne 


TA 








286 





426 








601 





632 








Sp |8s Sz) as 45/3 
gs |8s grids és 


Sp | as delas és 


zt 
eel 





spell it as follows: 





wigs Seiae é2 2 
pias Srids 2s 


gy 


3 
= 


commaal 
By means of « 12-digited code number, cards of all names containing not over six letters in the last name and not over two initials can be 
perfectly alphabetized by machine sorting st a speed of some 2,000 cards completely alphabetized per hour. 




















JOURNAL OF EXPERIMENTAL EDUCATION 


APPENDIX B 


Abbreviated Code for Coding Father's Occupation 
Code No. 6 


90 Laborer (Handles Tools) 
80 Launderer 

67 Lawyer 

75 Amusements 48 Lumber Dealer 

58 Architect 15 Machinist 

52 Army (Navy) 28 Mail Carrier 

59 Artist (Sculptor, teacher of art) 91 Manager-Executive 


60 Author (Editor, Journalist, Reporter) 16 Manufacturer 
39 Automobile Sales (Salesman) 68 Mechanical Engineer (Mining Engineer) 


0O No Data 
82 Accountant (Auditor) 
57 Actor 





O03 Baker 
40 Banker 
76 Barber (Beauty Parlor) 


17 Milliner 

02 Miner (All Mining) 
18 Molder 

69 Musician 


04 Blacksmith 

83 Bookkeeper (Cashier) 

OS Bricklayer 

41 Butcher 

42 Business or Trade (Owner of Business) 
06 Carpenter 

26 Chauffeur 

61 Chemist 

62 Civil Engineer (Construction Engineer) 
53 City, County, or State Official 

63 Clergyman 

85 Clerk-Salesman (Employees) 


92 Nurse 

19 Painter (Interior Decorator) 
70 Photographer 

71 Physician 

20 Plasterer 

21 Plumber 

56 Policeman (Detectors & Preventors of Crimes) 
22 Printer 

29 Railroad Agent 

30 Railroad Brakeman 

31 Railroad Conductor 

32 Railroad Engineer 


~ 


heated >i ee 


re nt 
eee sit GA dora ; 
RR ET yo ee, 


PP. EN 
New 


coins alin ills acelin 2 SOOO A AT iy 








07 Contractor 

77 Cook 

86 Deliveryman-Truckman (Goods not Persons) 
64 Dentist 

65 Designer (Inventor) 

O08 Dressmaker 

43 Druggist (Pharmacist) 

O9 Electrician 

66 Electrical Engineer 

10 Engineer (Stationary Engineer) 
ll Factory Hand (Machine Users) 
Ol Farmer 

54 Fireman (City) 

12 Fireman, Stationary 

87 Foreman-Inspector 

27 Garage (O11 Stations) 

88 Grain Dealer (Miller) 

44 Grocer 

55 Guard (Protection of Property) 
45 Hardware 

78 Housekeeper (Lodgings) 

89 Housewife (Mother) 

46 Insurance 

13 Ironworker 

79 Janitor 

14 Jeweler 

47 Junk 


33 Railroad Fireman 

34 Railroad Mail Clerk 

35 Railroad Unclassified (Railroader) 
49 Real Estate 

93 Restaurant (Soft Drinks, Hunger & Thirst) 
72 Schools-precollege 

73 Schools-college 

81 Servant (Menial Capacity) 

23 Sheet Metal 

24 Shoemaker (Makers and Repairers) 
84 Stenographer 

36 Street Car Conductor 

25 Tailor 

37 Telegraph Operator 

38 Telephone Operator 

50 Traveling Salesman 

51 Undertaker 

74 Veterinary 

94-99 Unclassifiable 

94 Agriculture 

95 Manufacturing 

96 Transportation 

97 Public Service 

98 Professional Service 

99 Domestic and Personal Service 
00 No Data 












March, 1933 H. A. Toops 





APPENDIX C 






Code for Decoding Father's Occupation 
Code No. 6 (a) 




















































0O No data 51 Undertaker 

Ol Farmer 52 Army (Navy) 

02 Miner (All Mining) 53 City, County and State Officials 

03 Baker 54 Fireman (City) 

04 Blacksmith 55 Guard (Protection of Property) 

05 Bricklayer 56 Policeman (Detector & Preventor of Crime) 
06 Carpenter 57 Actor 

07 Contractor 58 Architect 

08 Dressmaker 59 Artist (Sculptor, Teacher of Art) 

09 Electrician 60 Author (Editor, Journalist, Reporter) 

10 Engineer (Stationary Engineer) 61 Chemist 

1l Factory Hand (Machine Man) 62 Civil Engineer (Construction Engineer) 

12 Fireman (Stationary) 63 Clergyman : 
13 Ironworker 64 Dentist 

14 Jeweler 65 Designer (Inventor) 
15 Machinist 66 Electrical Engineer 

16 Manufacturer 67 Lawyer 

17 Milliner 68 Mechanical Engineer (Mining Engineer) 

18 Molder 69 Musician 

19 Painter (Interior Decorator) 70 Photographer 

20 Plasterer 71 Physician 

21 Plumber 72 Schools-precollege , 

22 Printer 73 Schools-college 

23 Sheet Metal 74 Veterinary 

24 Shoemaker (Makers and Repairers) 75 Amusements 

25 Tailor 76 Barber (Beauty Parlor) 

26 Chauffeur 77 Cook 

27 Garage (Oil Station) 78 Housekeeper (Lodgings) 

28 Mail Carrier 79 Janitor 

29 Railroad Agent 80 Launderer 

30 Railroad Brakeman 81 Servant (Menial Capacity) 

31 Railroad Conductor 82 Accountant (Auditor) 

32 Railroad Engineer 83 Bookkeeper (Cashier) 

33 Railroad Fireman 84 Stenographer ; 
34 Railroad Mail Clerk 85 Clerk-Salesman (Employee) 

35 Railroad Unclassified 86 Delivery-Truckman (Goods not Persons) 

36 Street Car Conductor 87 Foreman-Inspector 

37 Telegraph Operator 88 Grain Dealer-Miller 

38 Telephone Operator 89 Housewife (Mother) 

39 Automobile Sales (Salesman) 90 Laborer (Handles Tools) 

40 Banker 91 Manager-Executive 

41 Butcher (Meat Dealer) 92 Nurse 

42 Business or Trade (Owner of Business) 93 Restaurant (Hunger & Thirst, not Amusement) 
43 Druggist (Pharmacist) 94-99 Unclassifiable 

44 Grocer 94 Agriculture 

45 Hardware 95 Manufacturing 

46 Insurance 96 Transportation 

47 Junk 97 Public Service 

48 Lumber Dealer 98° Professional Service 

49 Real Estate 99 Domestic and Personal Service 





50 Traveling Salesman OO No Data 











APPENDIX D 


Religious Code 
(First Page of an 8 Page Code) 





98 Local Church name given--not the creed; for example ‘Myrtle Tree" (Federated) 
97 Church Member--no denomination specified (Protestant) (I am) 
96 Commnity (Union) 
Ol Advent Bodies 
a. Advent Christian Church 
b. Church of God (Adventist) 
c. Church of God in Christ Jesus 
d. Life and Advent Union 
e. Seventh Day Adventist Union 
O08 Baptist Bodies 
a. Colored Free Will Baptists 
b. Colored Primitive Baptists 
c. Duck River and Kindred Association of Baptists 
d. Free Baptists 
e. Free Will Baptists 
f. General Baptists 
g. National Baptist Convention 
h, Negro Baptists 
i. Northern Baptist Convention 
j. Primitive Baptists 
kK. Regular Baptists 
1. Scandinavian Independent Baptists 
m. Separate Baptists 
n, Seventh Day Baptists 
o. Six Principle Baptists 
p. Southern Baptists Convention 
q. Two-See-In the Spirit (Predestinarian Baptist) 
r. United Baptists 
The Brethren Church, see German Baptist Brethren 
Brethren in Christ 
Catholic Church, Roma 
Christian Science, see Church of Christ, Scientist (New Thought) 
Christian Union 
Church of the Berthren, see German Baptist Brethren 
Church of God 
Churches of Christ (Christian Church) 
Church of the Nazarene 
Congregational Churches 
32 Disciples of Christ (Campbellite) 
44 Dunker (Dunkard, Tunker) See German Baptist Brethren 
34 Eastern Orthodox Churches, (Greek Orthodox, Greek Catholic) 
a. Albanian 
b. Bulgarian 
c,. Greek 
dad, Rumanian 
e, Russian 
f,. Serbian 
g. Syrian 
35 Episcopal Church, Protestant 
37 Evangelical Church, General Conference (Calvary) 
38 Evangelical Congregational Church (Congregational Church) 


SSESEGSSRE 








JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No. 3 








March, 1933 






SIMPLIFIED SCHEMAS FOR MULTIPLE LINEAR CORRELATION 
by 
Harold D. Griffin 
Nebraska State Teachers College 
Wayne, Nebraska 










It is the purpose of this article to present a succession of formulas 
solving three- to eight-variable multiple correlation problems with an econ- 
omy cf time and effort unequaled by current methods. 

The writer has shown elsewhere? that it is possible to develop formulas 
based upon the Doolittle method for solving simultaneous linear equations so 
as to reduce three- and four-variable multiple correlation problems to 
straightforward substitution of the original zero-order correlation coeffi- 
cients, thus making it exceedingly simple to obtain regression coefficients 
without any preliminary study of the processes involved. The formulas in 
the article to which we refer, are as follows: 












Three Variables 


Bisa= ——yp— dd. Ctl 
1-r,, 













Bis = “i27Pis-2%23 









Four Variables 
os T1372 ™3 eae 
Te Tia Peg 1-73, (a4 Yay 73) 
By.23 zs = .. - ag 
L~ = -_ Se (x = Tay x3) 
















Tis Ti2 ¥a3 Taq — Say 23 24 
Bis.24 * i-<*, te: Biy.23 {- ri, : 2b] 





Brau * Ti. (Pa24 Tas + Pras Fas ) wan 7 (2-] 


1. Harold D. Griffin, Annals of Mathematical Statistics II (May, 1931), pp. 150-53. 




















JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


The three-variable formulas need no further simplification for clarity, 
but in a four-variable problem it is readily apparent that if symbols are 
substituted for certain expressions in the 6j4.03 equation repeated in the 
ensuing equation,§).04 » that the tendency to duplicate previous calculations 


will be eliminated. Thus: aa a 
Ligrs _ Kyb-2~ 913-2 Sy3-2 [3a] 


314.23 * Wy-3 . Day — (343.2 %y3-2 





_... Db) 


By 3-24 - (3.3.2 Byu.23 Su3-2 oe eee 


Bs4 = Fa (Bis.24 723 F B14-23 Tay) -.. Bel 


This substitution involves two new concepts, a (Greek, alpha) for the 
numberator and w (Greek, omega) for the denominator of the partial regres- 
sion coefficient, @(Greek, beta) .2 

The introduction of these two new symbols enables us to write endothet- 
ic formulas© for the calculation of partial-regression coefficients for 
any number of variables. These endothetic formulas are simpler to under- 
stand and follow than the Doolittle method which inspired their development, 
for they involve only substitution of zero-order correlation coefficients 
and their derivatives in formulas of a few simple types which remain con- 
stant for any number of variables. 

The Doolittle method for solving simultaneous linear equations mention- 
ed in the preceding paragraph, although long used in the United States Coast 
and Geodetic Survey and by some engineers, was first introduced to statisti- 
cians by Tolley and Ezekiel in 1923.5 The superiority of the Doolittle 





1. Statistical purists may express criticism of the order of the subscripts in H4z.o and 
elsewhere. It is true that a complete expression of 6 4.9, in terms of S's, 


Byy-2 — 313-2 B34-2 
Byy.23 - " — Gy3.2 u-2 


would indicate that €s4.o9 is the more exact expression. Inspection of the ensuing 
schemas will show, however, that writing the subscripts in the order c 4s.9 80 as to 
agree with G4zs.o, will avoid confusion in using the formulas. ° 
Used in #8. zero-order, «) is merely Kelley's concept, k* = 1 - r~. In successive- 
ly higher orders « becomes increasingly more intricate. 
The writer has coined the term "endothetic formulas" in order to express their compact, 
telescoping qualities. 
H. R. Tolley and M. M. B. Ezekiel, "A Method of Handling Multiple Correlation Problems," 
Quarterly Publications of the American Statistical Association, XVIII (December, 1923), 
pp. 995-1005. 
Excellent presentations of the Doolittle method are to be found in the following: 
H. A. Wallace and G. W. Snedecor, tion and Machine Calculation (Revised edition. 
Ames, Iowa: Iowa State College Official Publication, ° 
C. C. Peters andE, C. Wykes, "Simplified Methods for Computing Regression Coefficients 
and Partial and Multiple Correlations," Journal of Educational Research, XXIII (May, 


1951), pp. 383-395. 















H. D. Griffin 





method over solution by determinants+ or by iteration methods* has been in- 
dicated elsewhere by the writer.© Where partial correlation coefficients 
are especially needed, the conventional but tedious partial correlation pro- 

cedure for obtaining the regression equation is generally followed.* It is, 

however, possible to obtain certain partial correlation coefficients by 

proper manipulation of the schemas given in this article. Thus, in a three- 
variable problem, the following two partial correlation coefficients are | 
easily obtained by endothetic formulas: 


Piseg =|// 1 - (1 - Rh.es)/(2 - rig) 



















- 3 2 
Tig-s =/ 1 - (1 - RY.og)/(1 - rjs) 
And in a four-variable problen, 
‘ ~ (1 Re =—y 
Pigios = 1 - (1 - RE og )/0 ~ R ies) 
=/1-(1-R? ~ Re 
T 13-24 1 - (1 - RE p54 )/(1 - RY .p4) 


12°34 =f is (1 - RY 954/02 7 R Tse 


A little study by the reader will indicate the necessary modifications to 
be made in the endothetic formulas in order to obtain these and other par- 
tial correlation coefficients.° 

































l. A systematic presentation of three-, four-, and five-variable multiple correlation prob- 
lems solved by determinants is given by C. C. Peters and E. C. Wykes in the Journal of 
Educational Research XXIV (June, 1951), pp. 44-52. 

2. For example of an iteration solution, see F. S. Salisbury, "A Simplified Method of Com- 
puting Multiple Correlation Constants," Journal of Educational Psychology XX (1929), pp. 
44-52. 

5. H. D. Griffin, "On Partial Correlation vs. Partial Regression for Obtaining the Multiple 
Regression Equation," Journal of Educational Psychology XXII (January, 1931), pp. 55-44. 
H. D. Griffin, "Fundamental Formulas for the Doolittle Method Using Zero-Order Correla- 
tion Coefficients," Annals of Mathematical Statistics II (May, 1931), pp. 150-153. 

4. For partial correlation methods involving three, four, and five variables, the writer 
recommends, 

H, E. Garrett, Statistics in Psychology and Education. (New York: Longmans, Green and 
Company, 1926), pp. 225-251, 240-251. 

For six variables consult 

C. L, Huffaker, "A Contribution to the Technique of Partial Correlation," Journal of 
Applied Psychology, VII (1923), pp. 155-142. 

For seven- and ei t—variable problems, see 

J. E, Bathurst, "A Partial Correlation Schema," Journal of Applied Psychology XI (1327), 
pp. 155-164, 

5. For the derivation of these, perhaps, unfamiliar formulas for partial correlation coef- 
ficients, the reader is referred to 
M. M. B, Ezekiel, Methods of Correlation Analysis (New York: John Wiley & Sons, 1930), 

pp. 179-181, 378-379. There is also a new concept, the coefficient of part correla- 
tion, 

































































JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


The writer is confident that any statistician who will give these endo- 
thetic formulas a test will never return to partial-correlation, determinant, 
or iteration methods in solving problems within the scope of these schemas, 
Even after their involved techniques have been mastered, iteration methods 
take from three to five times as long, and determinant and partial-correla- 
tion methods more than ten times as long, to solve a seven-variable problem 
with even approximate accuracy. One reason for this is that the endothetic 
formulas will obtain final two-place accuracy using only three-place figures, 
whereas the other methods require from four- to six-place figures. 

In each of the following schemas, a check is furnished on the accuracy 


of the partial-regression coefficients. 


THREE-VARIABLE SCHEMA 


Required Zero-Order Data 


M,, M2, M,- oq; Ty) T3: i> Tigy 723° 


Final Partial-Regression Coefficients 


C3 — Tin B23 
B5.2° ‘ 
13.2 i-r,, 


32.3 * Hi Bis-2 Fas 


[check: 143* Byas X23 + Bis.2 J 





Multiple Correlation Coefficient 


Ryues * Vv Biz.3%2 + Bi5.2 li3 








(Footnote continued) 








I 
Pap Rieose 
2 
Pio-s4 


easily found by endothetic formulas, and which possesses sufficient value in analyzing 





variables to be worthy of your consideration. For a usable presentation of the coeffi- 
cient of part correlation, consult the following article and its references: 
H, D. Griffin, "On the Coefficient of Part Correlation," Journal of the American Statis- 


tical Association XXVII (September, 1952), pp. 298-501. 










H. D. Griffin 








Regression Equation 





b,2.3* Bn3%/% 
bi3.22 Bi2% 45 | 
K= M,— (M3 by2.3 + M3 b,3., | 
Kz bX, +},,X,+K 





1 “iad 13-2 






Standard Error of Estimate 


Taz OO V1-R).,, 


' 









FOUR-VARIABLE SCHEMA 








Required Zero-Order Data 


M,, M,, M3, My. 


Ta T3 ’ Ty ’ T23) Tay ’ Tay 

















First-Order Calculations 








4a 
Hig.2 * Fig — Fiz Mh ey? 1%, 





2 
Way ®t 1- ray 
Bia27 X32 J Wa3 


3y32= Gy32/' O23 | 


Final Partial-Regression Coefficients 





Big = Cig ~ Pia Poe 





Ky3-2 * Tsu ~ My Fay 





Puss * Higa — Bis-2 %ys-2 
23 Mag — By3-2 Xa3-2 


(313.24 - (313.2 — Byy.23 343.2 
(r2.34 = T.27 CBys.24 T23 + Buy.23T a4) 


(Check: t3* Bi2.04723 + Bys.24 + Br4-23 T34 J 













JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


Multiple Correlation Coefficient 


Ryasa ” v Pyan Ta - Bis.24 F\3 + Bwu.23 Tig 





Regression Equation 


© Brass %/% 
Basd® Birt7/% ae NE — CMa brag MgBya.gut Mubuyas) 


Bis.2u= Bis-24% /t5 an 
big.as* 34-23% Oy Xe by syXotbyyryX3t by rzyry + K 


Standard Error of Estimate 





2 
Jj.23u ~ g,v 1- Ryasu 


FIVE-VARIABLE SCHEMA 


Required Zero-Order Data 


M,, M2, Ms, My, Ms. o> Jz, 93, Ty Os - 


Tar Tigre Try 75» *agr Tay Tasr Tay, Tag» Tys- 


First-Order Calculations 


Hiz.2* Figm "ie ©23 w23* 1- Ti 
aye 1-54 
Was * 1-Tis 

Bisa = %3.2 / O23 

Bys2* Ty3.2 VA W23 


Bes.2 = &53.2 J. 23 


Hiy.2* Tam Tig Thy 
Hi5.2* Tis Tz * 5 
Xy3.2 = Ty — Tas Fey 
%§3.2= T35~ Ty Fras 


&sy.2* Tys~ My *25 












H. D. Griffin 


Second-Order Calculations 






XK y.23 = Fig-2 313.2% y3.2 O24.3 = Woy By3.2 T3.4 





15.23 = X5.27~ (3,3.2%53.2 WO75.3 = Ws 53.2 53.2 






Ps ; Ky. ji Ww 
Xsy.23 = “54.2 353.2% 43.2 Biq.23 = %y.23/W24.3 
Psy.23 = Acy.23 Jeong.4 









Final Partial-Regression Coefficients 





415.23 — Biy.23 Xsu-23 
025-3 — Bey.23%sy-23 — 

Biy.235 as By.23 none (315.234 354.23 

Brs.2us = Pr3-2— (By 235 Bug. + Bys-23y Rss. J 


rz. 345 = Ty. (Bis. 245 ©: 23 By. 235 Taq t Bus. 234 Ts ) 


[Check : Ti * Biz-sys*a3 ° Rrs.24s + 3y4.235 Tay + Bis-234 35 J 





Bis.234 7 









Multiple Correlation Coefficient 














Ry.a3y5 * mf Pra-a4s Mat 33.245 Xia + Rya-a35%i4 + Bs. asy ty 





Regression Equation 


biaus = Pi2-a4s 9, /2, bys.235 = Bya.23s %/% 


byi3. 245 = 313.245 Oo Jo; “bys.234 - (315. 234 g, fe 





an 






K= Mi— — (Mzb bra. aust M3b igaust Myb 14. 235+ Ms bys. aay) 
= Ba.sus%at Dgags%3 + Pyyass tut bs. 25y Xs + K 














JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 






Standard Error of Estimate 















Jiasus = % vi- _ 







SIX-VARIABLE SCHEMA 





Required Zero-Order Data 
M,, Mz, Ms, My, M,, Me- 
Si, F2, 93, Ty, Ts, Te: 
Ta) 3) Ty) v5) Ter 23, Tay, Tass Tres 
Ty, TSS) Teer Tys Tyc, Tse. 
First-Order Calculations 








r 7 
— rae 


4 
ee ee a 





—_ a 
H3.2 * Ts Fa *23 Ma32= 1-17, 


— 
Ss 
= * 





Wraye 1-rhy 
W25 = 1-r3, 


2 
26> 1-Ti¢ 


Xig.2 = Ty Fe Fy 






5.2 = Mis “2725 


— e 
= go 





432 ™ Tey 7 Ty 423 Rra2 e Eee / 23 
53.2 = T357 Tyg Ty 


2= My. w 
sy. 22 Tysm T25 Thy By 2 43-2 Ji 23 
63.2 = Tee Tre Ta3 (352 = X53.2 VA W3 


Meq-2 = Tye Fig Thy Rs2 = Xa2/' W223 
&s.2 = Tem Meelis 


= 





anne. Sig 
ee ee a oe a — 











Second-Order Calculations 








Xiy.23 * X42 By3.2 Zy3.2 Wa4-3 = Way By3.2 X43.2 
5.23 = X15.2—- Biz.2 X $3.2 W25.3 = W25 — 353.2 X§3.2 
Xsy.23 = Xsy.27 Resa Hy32 3yy.23 = ae W24.3 






Xo4.23 * Fe4.2— Bena &y3-2 Ry. 23 = Xsy.23 1 Way.3 


Ke5.23 = Keg.2— 363.2 X 53.2 (Se4.23 - Kyr0/ 24.3 












March, 1933 H. D. Griffin 247 







Third-Order Calculations 





15.234 = Xis.23 — 314.23 Ks4.23 
H 16-234 = Xi6.234 — 14.23 X og -23 
“65.230 = X5.23 Rcu.23 Xgy.23 






45.34 = “25.3 Asy.23 Xsy.23 
Wre.44 = W2¢.3— Bey.23 Xey.23 | 






Bis-23u S Xig.234 7025.34 | 
Res.23u = Xes.23y J w25.34 






Final Partial-Regression Coefficients 






3 “16.234 ~ Ais.234 65.234 
16 46.34 — Bes.234 %65-234 








Bis* Bys.234 — Bre Bes.234 


ry = Bw.23 — (Ris Bsy.23 + Bic Bos.a3) 
Bis o Bys-2 = (Bry Bus.2 + 15 B53.2+Pre Besa) 


3,27 oT oa (Arstas + Ryu + B ems + BicXec) 







[Check: 1," Batt Bist Ruy tRc%s+ Ae J 









Multiple Correlation Coefficient ~ 









Ry.23486 a / B27 + Bisé\3+ Br Sy t Bstist Bic 6 






Regression Equation 
ba" Bi GG, By Py» By Bu /G, Bs= Bis 7/5) 4° Aer/% 

K+ N- (M,b, + Mab; + Myby + Msb + Meh, ) ! 
X,* bX, + by%3+ bX, + byXs +b A+ K 






Standard Error of Estimate 









e — Ps 
S.asusc* V1 R).asusc | 


JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


eS. 


SEVEN-VARIABLE SCHEMA 
Required Zero-Order Data 
Ms, M;, M>. 
Os, Oz > Ts - 
T15> Te: 


P26) 27) 


nt separ eee, ~~ =~ oe ®t 2 peg 


Tse, T57, 67. 


First-Order Calculations 


Ms = Ta F23 


a 
Wr4 s 1- 43 


Ty ~ T2 P24 . 
Oy = 1-Cwy 
Tis ~ Tia F25 . 

Wa5 = 1-15 
Lie ~ Tir Fre “ 

O26 = 1-2 
T7 a Ta bet 4 
@2. = 1-T a7 
Ty — Tay 33 


r -rT boy 5 
” > Brs.2 = Xig.2/ “a3 
Tuc WT Fe 
45 25 “24 
Bus.2 = %q3.2 / or 


M6 ~ “26 723 / 
Bsa. = S527 023 


Tye ~ Taek 24 
T56~ Tre 25 Bes.2 = &s.2 J. 23 


Lap — Moh 
37 ~ Ta7 X23 

a3. * “73-2 / 2 
Tyr ~ May Fay 


T7 — 27 *25 


X67 — Thr Pre 











H. D. Griffin 








Second-Order Calculations 





a a = 
14.2 Praa 43-2 44.3 = Way By3.2%y5.. 







Hig.a3* %s.2~ B32 53-2 025.3 = Was5- Bs3.2 %s3.2 
Zy6.23 = Ae.2~ Bys.2% 3.2 O23 = W26- 3¢s.2 3.2 
Qy7.23= X7.27 3\3.2%73.2 W273 = W27- By3.2 X73.2 | 






Xoy.23* Asy.2~ B53.2% 43-2 





%oy.a3 = Xey.2~ [e3.2 Luz. / | 






Xé5.23 = 45.27 Bes.2 53.2 






Bsyzs = Xy.23/ Ways 
Beu.23 = Ry.23/ Way. 


344.23 = Koy.23 / Way.3 






X9y.25* K74.27 873.2 43.2 







A75.23* X75.27 353.2 53.2 






hng.2a = X76.2> (373.2 %3.2 











Third-Order Calculations 
Xis.2z3aq = Xi5.23 ~ Byy-23 Xsy.23 






Xig.234 = X1g.23— Byy.23 Xoy.23 
17-234 = Xy7.29 — (314.23 X74-23 







eg.234 = Xe5.23— Fey.25 K54-23 Ays.234 = %5.234 728.34 
X75.234 = 275.23 — 37y.23 Xsy.23 Bées-23u = Xe5.23y A 028.34 
X76.234 = A7g.23— 374.23 Meu.23 (375.234 = X75.234 ~* 025.34 













Was.34 = 025-3 — Bsy.23 Xs5y.23 
Wr6.34 ® W263 - (e4.23 Xouy-23 
©27.34 = W27.3— 37y.23 “74.23 
Fourth-Order Calculations 












16-2345 = “ic.234 — (3)5.234 Xos.234 
%17-2345 = Ay7.434 — (345.234 X5.234 
A76-2345 © Xe.234 — (375.234 a es.234 







26.445 = W6.34 — ¢s.234 Xs5.234 
Wo7.345 * W7.44 — 375.234 X75.234 






Bie-2345 be Xig.asus/ W26-345 
Brg-2345 = arg-2345 /W¢.345 






JOURNAL OF EXPERIMENTAL EDUCATION ; Volume I, No, 3 


Final Partial-Regression Coefficients 


— 17-2345 ~~ B,6-a345 X%IB.2345 
7.345 — 376.2345 *76-2345 





Bic-asus mts Bi (376.2345 

Bis.234 — (Pre Bes.a3y + A G1s.294) 

Bi4.23 7 (Bis Beu.a3+ Pre Beu.2s + By Br4.23) 

Bisa ~ ( Buy Bys.2 + Bis Bes.2 + Bie Pe.2 + A Bo3.2) 


T.7 (Bis 13 + By Tay + Bs mast Bic %_ + Rar 27) 


Ceneck: 13 * Biz ist Bist By taut Boras t+ Bric Bi*s7 ‘7 
Multiple Correlation Coefficient 


Ryasyser = 4/ Paka tAs%ist Bight As%s tBicMe* Biz%7 
Regression Equation 

bat Pia %/% bist Bs 1/% 

bist Bis 5/5 bict Be G/G 

bu = Pu Gay bi* BT /%; 





K= M,—(Mzb,, + Mgb,+ My by + M5 b+ Mbt M,»,,) 
X= b,X,+,X+ bX th Xet BXe* by +B 


Standard Error of Estimate 





Tuase5e7 = VIR acy 





March, 1933 


= %7~- Taz 26 





H. D. Griffin 


EIGHT-VARIABLE SCHEMA 


Required Zero-Order Data 


™M,, Mz, Ms, My, M,, M., My, M,. 
%, Sz, Gg, Sar GS, BY, G5, Ts - 


vr rT 


13" 14’ Bs" yn x 


ie’ 17? ir? Pyar 


Tin Faw Meg, Tys) Wy) ™ 7) 3s, 


Tuer Fu7! Tyg Meer Sqr Teg, Fen Togs Tog: 
First-Order Calculations 


Tig 7a *23 453.2 = yy ae *23 
Tym Ti Pry Aey-2 = Tyg — Tay Foy 
Ags. Tse Tag F 
Tis ~ 12 Fas $5-2 = Tse Fag Fas 
- Xye.2 = Ton Me Tre 
Tie i2 * 26 ¥6 od sa 
X97.2 = Tem Fas *27 
C7 ~ Kya 527 a 
Wr, ei-r 
Tig ~ Tiz Tae 23 = 4 
Way = 1- Toy 
@®25 *=* i- ris 
W26 = 1-T%e 
Tus ~ Tis Tay 27 1-x%7 


T36 — F26 F23 Way 1-<ie 

Tye” P26 “24 Bis2 = %3-2 [i 23 
T¢- Tre Tas Bys2 = Sya0 /ors 
X53.2 J 23 
63-2 fe W223 
73.2 /W3 


X 53-2 Js 23 


Taq ~ Tay X23 


B35 ~ T25 Taz 


437 — T27 723 
Fy7~ Taz Faq 


Ts7 ~ X27 Tis 














JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 










Second-Order Calculations 






QXiy.a3 = “i4.2 7 (3.3.2 143.2 













15.23 = Aj§.2 7 313.2 Xs 3.2 Wy.3* Wry - 3y3.2 X43.2. 
Xjeo-23 = Aie-2- By3.2 X32 025.3 = W5—- 353.2 XS3.2 
17.23 = 7.2 — (3)3.2 X73.2 026.3 * “26 Be3.2 &e3.2 
019-23 = FiK2— Fi5-2 Xo3.2 27.3 = W22- 373.2 73.2 
X 54.23 = X5y.2- 353.2 Kys.2 O2¢.3 = Wag~ 33.2 Xys.2 
XGy.23 = Aey.2— 3Be3.2 Xy3.2 

X65.23= Aes.2— 33.2 X53.2 Biy.23 = Kiy.2s / ®r4.3 
H7y.2% = Aqy.2- 373.2 243.2 Psu.23 = 54.23 / Way.3 
X75.23 = X75.27 (373.2 %s3.2 Boy.23 ™ Keu.as / 0024.8 
AXe.23 = X72 373.2%¢3.2 Boy.23 = X7y.23 / wm 







« = Xey.2— (393.2 Xy3.2 
$4.23 $4-2 3 Pgu.23 = agy.23/ 24.3 







Ags-23 = X5.2— 353.2 ~53.2 





Xqe.23 = Age.2— Bg3.2 %o3.2 







467.24 = Aq7.27 393-2 “3.2 


Third-Order Calculations 










5.30 = “25.3 ~ B5y.23%54.23 
©2634 = W26.3— Bey.23 Xy.23 
927.34 = 47.3 — (374.23 X74.23 
W243 Wag.3— Bgy-23 Xqy.23 





XKis.2a4 = F15-23— Biy.23 %54.23 
Xio.213d = %16.23~ 34.23 &Gq.23 
17.234 = Xj7.23~ Byy.23 X74.23 






Aig.234 = Xig.23~ Byu.23 Ay +23 

X¢s.234 = Age.a37 (o4.23 Xsy.23 Bis.23u ” His.23d / 25.34 
= ‘ _ ‘ x 23 

A75.234 = 475-23 “te S4 As.234 = %es-234 / 025.34 
X7G:24 ® A723 [974.23 Key.23 

7 Bog-234 = &7§.234 J. 25.34 


AXoe.234 = %$5.237 Beu.23 Xsy.23 / 
XG6-234 = Age.23— Byy.23 4-23 Bes.234 = Xqs.234 / 25-34 


Xo7.234 © %y7.23— Bey.2s Boy .23 
















H. D. Griffin 


Fourth-Order Calculations 






H 16.2345 = Xe.234 — Bys.234 %es-234 26.345 = “26.34 ~ Pes.234 “5-234 
447.2305 = X7.230 — 345.234 75.234 27.345 © W27.34- 375.234 “75.234 
Xig. 2345 * Lig.asq ~ Pis.234 Xs .234 26.345 = W2y-34- Bys-234 Xo5.234 






76.2345 = X76.234 — Prg.234 465.234 Bic-2345 * Kic-asus / 26-345 


Xee-2345 = Age.23y ~ Bys.234 65.234 (356-2345 e 76.2345 / W26.345 
Bqe-2345 = Xye.2s45/ 26-345 | 







XL e7.2345 = XH7.234 — Bgs.234 X75.234 






Fifth-Order Calculations 






a 5.23456 = X7,.2345 — By¢.2345 76.2345 
Lig.23d66 = HH-2345 ~ Bie-2345 “ee. 2345 3,4.23486 = a 


Xq7.23456 = Xy7.2345 ~ (35¢.2345 X 76-2345 Byy. 2ause = Xo9.2356 / @ 27.3456 


27.3456 = W27.345 ~ 316.2348 X76-2345 
Way.3456 = Ory 345 Bgc.2345 % oe-2345 









Final Partial-Regression Coefficients 










Xi¢.23456 — F,7.23456 Xe, 23456 
25.3456 — Bev. 23456 Xo7.23456 





By * 


Biz * Piz.2suse— Bis Be7. 23456 

Ric * Bic-2aus aa ¢ S17 Poe.2348 + Big Bye. 234s ) 

Bis = Bys-2a4— (Bre Pes-234 + Piz Pos.2syt Pi Pes-23u ) 

Bu = Burs (Ps sacs + Pre Bours Bry Boy.23 Pre Beru-23) 

Bis = Bi2.27 (Bry Bys.2+ Pris Bss.2t Pie B.3.2+(3, B53.2+ Big Bes.2) 


B® Se7 CRisk23+ By Teyt Bys%5+ Beet Bote7 + Big 25) 
















(Check: Ty Rake At By %q + Pigs M5 + 36X36 % Pip X37 + Big My J 








JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


Multiple Correlation Coefficient 


Ry.asqse7% - J Bra T27 As%s . By t + AstistAcr.t Bon, + Bis 





Regression Equation 
bet Pa%/ bye = Be G/M 
bas Ps%/3 —— biz= Bro /o5 
bu = Bua bis = Pig F/R 
bis = Gis 445 
K = M, ~ (Mzb,, + M3 3+ Myby + Mgb,+ M.b,+ M,b,, + M,b,, ) 
X= bX, + bX, +b, X, +b,.X, + bX, + b,X,+b,X,+ K 


Standard Error of Estimate 





~ a 
J .a3use7¢ 7 qv1- ties 


A study of the schemas will reveal the general plan for expanding these 
endothetic formulas to care for additional variables. The use of small cap- 
ital letters for subscripts with ten or more variables, as is the practice 
among many agricultural research workers, is to be recommended as less con- 


fusing than numerals. 


Se te ie fa 


pe ra 














march, 1933 


A FREQUENCY-PRODUCT METHOD OF OBTAINING THE STANDARD DEVIATION 





Herbert A. Toops 


If the several rrequencies of a distribu- 
tion table be given literal designations, an 
interesting formula for obtaining the stand- 
ard deviation results, 

Let the distribution and the extensions of 
steps be as given in the following table: 























x 
Marks|(Point | Freyuency X.Fr x? rr 
Value) 

A 4 a 4a 16a 

B 3 b 3b 9b 

C 2 c 2c 4c 

D 1 d a d 

E 0 e 0 0 
Sums, and N= rox = ox® = 
designation | a+b+c+dt+e | 4a+3b+2c+d | 16a+9b+4c+d 





If in the usual gross score formula for the 
standard deviation, 


nax® - (rx)* 
Ox = ne = 


we substitute the above found literal, equiv 
alences, namely, 





Lyx 


n? 





(1) 


Nea+bdb+eo+de+e (2) 
x* = 16a + 9b+4c+4 (3) 
X = 4a + 3b + 2c+4d (4) 


we have, 








3 vfsstecesce) sarsoetond)—(4aemnezordl 
2 
My 
' (5) 
After expansion and simplification, we have 














ab+4ac+Sad+l16ae a(b+4c+9b+16e) 
eae +be+4bd+9be _ +b(c+4d+9e) (6) 
+cd+4ce +c(d+4e) 
\ +de +d(e) 
nz ye 


The result is curicus in that the numerical 
coefficients are a series of perfect squares 
of the numbers, 1, 2, 3, 4, etc. By careful 


Ohio State University 











inspection this will be found to be equiva- 
lent to the following job-analysis steps of 
a@ concrete problem. Let the problem be the 
solution of oy for the following distribu- 
tion: 














Mark Frequency Stencil 
A 5 -— 
B 22 1 
C 48 4 
D 17 9 
E _ 16 
100 


The steps in the solution are: 

1. Place a stencil such as shown above, 
position as shown, 
Employing columns 1 and 2 of the key- 
board of a calculating machine and win- 
dows 1, 2, 3, of the product-dial, re- 
peat key depressed, dials cleared, accum- 
ulate the sum of the frequencies times 
their respective stencil multipliers, 
namely, (22x1) + (48x4) + (17x9) + (8x16) 
= 495, 

It is necessary to clear the miltiplier (up- 
per) dial after each miltiplication while 
allowing the several products to accumlate 
in the product dial. 
3, Set 495 in columns l, 2 and 3 of the key- 
board; crank negatively once (thus clear- 
ing out the lower dial to zero); clear 
the multiplier (upper) dial; mltiply by 
the frequency opposite the arrow, namely 
5, thus yielding 2475 in the product dial, 
Set 2475 in columns l, 2, 3, 4 of the key 
board; crank negatively once thus clear- 
ing the product dial to zero; move the 
carriage to the extreme right as far as 
it will go; add 2475 into the product 
dial; place a decimal pointer on the prod- 
uct dial to indicate that the sum is 2475 
and not 24750 or 247500; move the car- 
riage back to the extreme left as far as 
it will go. Clear the mltiplier dial. 


in 


2. 






















256 


5. Pull the stencil down one step into the 


following position: 


22 
48 
17 

8 


[—] 
1 
4 
9 








16) 


6. Accumlate in windows 1, 2, 3 of the 
product dial the sum of the frequencies 
times their respective stencil (new) mul- 
tipliers, namely: (48x1) + (17x4) + 
(8x9) = 188, 

Set 188 in colums 1, 2, 3 of keyboard; 
crank negatively once; clear mltiplier 
dial; miltiply by the frequency opposite 
the arrow of the stencil, namely 22, ob- 
taining 4136; set 4136 in colums l, 2, 
3, 4, of keyboard; crank negatively once 
clearing right-most four windows of prod- 
uct dial to zero; move the carriage to 
the extreme right as far as it will go; 
add the 4136 in the keyboard to the 2475 
already there, thus obtaining (2475 + 
4136) = 6611. Clear the mltiplier dial. 
Pull the stencil down one step into the 
following position: 





C 48 
17 
E 8 


1 











Repeat, analogously steps 6 and 7, thus 
[(17x1) + (8x4)] x 48 = 2352, which is 
likewise added to 6611, making 8963. 
Pull the stencil down into the final po- 
sition, 





17 
8 


1 











This operation nets us [(8xl)] x17 = 
136, which, added to the previous 8963, 
yields 9099 = Ly,, whence 

Lia 


9099 sO 
ox * = = \.9099 = ,954 
. N 10000 V 


JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No, 2 


We may check the value of Ly, = 9099 
by using an inverted stencil, correspond. 
ing to the formla 


= |at(bt + 4ct + Odt + l6e!) 
+ b'(ct + 4d! + 9e!) 

+ ct(d! + det) 

+ d'(e!) 


9/;4//1 
4); 1) i- 
Lijie 

= 


®@O@® 


in which method we raise the stencil one unit 
after each set of miltiplications. The fig- 
ures in circles show the four successive po- 
sitions, in order, of the stencil. To save 
space, the four accumlations are simply in- 
dicated: 


(1) [(5x16)+(22x9)+(48x4)+(17x1)] 8 
(2) [(5x9) +(22x4)+(48x1) ]17 
(3) [(5x4) +(22x2) ]48 
(4) [(5x1) j22 


Lyx 


developed from the system, 
X Frequency 


Literal Numerical 
e! 5 
a! 22 
c! 48 
b! 17 
a! 8 


16 





Obviously the method can be extended to larg- 
er distributions simply by lengthening the 
Stencil; also, it is clear that the opera- 
tions may be carried through almost as auto- 
matisms, once the method is learned, ina 
small frection of the time here taken to tell 
how to do the operations, 

If the columm above called X were step 
scores, the obtained o mst be multiplied by 
the class interval to obtain the true o. 

There is no essential reason why the fre- 
quency column should be stationary and the 
Stencil movable; in fact, quite the reverse 
arrangement is advantageous for then we my 
employ the stencil of the figure on the fol- 


lowing page. 





H. A. Toops 257 


2. Copy the left-hand marginal frequencies 
a, Db, C..eeeK Of the left-hand margin on 

1 a sheet of paper with the same horizontal 

4 rulings as the stencil. 

With the frequencies in position as shown, 

. . perform the miltiplications indicated by 

16 4 multiplying the frequencies by the com 

25 9 partmental stencil entries of the first 

column, allowing the products to accume- 

36 16 late; then mltiply the sum by a and trans- 

49 25 fer to a storage window, 

Move the frequencies to the right, at 

right angles until the column headed a is 

81 49 covered by the frequency sheet; and re- 

k |} 100 64 16 peat, 

— So continue until multiplications are no 

longer possible. The stored sum is L,:. 

For a check write the frequencies in in- 

verted order, placing instead the fre- 

quency of the lowest class at a, with 
higher classes ranging down. Repeat, 

1. On a mimeographed sheet, such as_ the a- Obviously the method becomes cumbersome 
bove (Figure 1) replace the right-hand with more than a few classes. It my have 
literal frequencies by the numerical plac-/ useful applications in designing calculating 
ing the frequency of the largest class at | machinery. 

a. 


64 36 














= 
Slip with movable frequencies 


Figure 1. Composite Stencil With 
Movable Frequencies 




















JOURNAL OF EXPERIMENTAL EDUCATION 


VALIDATION AGAINST A 
Edward E. 


Volume I, No, 3 


FALLIBLE CRITERION 
Cureton 


Alabama Polytechnic Institute 
Auburn, Alabama 


INTRODUCTION 


The problems with which this paper is 
concerned arose in attemptinc to validate a 
battery of tests against the criterion of 
success in college. Validity is ordinarily 
defined as the success with which a test 
measures what it sets out to measure. In 
validating a battery of tests, we must, there- 
fore, obtain the equation of miltiple regres- 
sion of the criterion upon the several tests, 
The regression coefficient of each test is 
its weighting factor in the battery, and the 
multiple correlation coefficient (or its 
square) is the statistical coefficient of 
practical validity. 

Success in school or college is contin- 
gent upon a multiplicity of factors. It de- 
pends, among others, upon the intelligence 
and interest of the student, the amount of 
studying he does, the scholastic standards 
of his particular instructors, the validity 
and reliability of the examinations and 
tests he takes, and the number and variety 
of outside activities in which he engages. 
Some of these factors are relatively gener- 
al and permanent, while others are tempora- 
ry or irrelevant to the student's underly- 
ing abilities, attitudes, habits, and in- 
terests. The test or battery of tests 
should be validated against these systemat- 
ic factors in the student; freed if possible 
from the disturbing effects of the unrelia- 
bility of the marking system and even of 
those fluctuations in attitude and applica- 
tion that cause a student to do better or 
worse work at one time than at another. It 
is necessary, therefore, to determine the 
reliability of available measures of school 
and college success, and to investigate the 
multiple regression equation and the multi- 
ple correlation coefficient when these are 
corrected for the unreliability or attenua- 
tion in the criterion, 





School or college success is ordinarily 
measured by the average percentage, if grades 
are reported in percentage terms. Each per- 
centage, in averaging, is weighted by the 
number of credit-hours of the course. In 
schools that mark on a five-point scale or 
some similar device, the point-hour ratio is 
employed. Each grade of A, say, is multi- 
plied by 4 times the number of credit-hours 
of the course in which it is received, each 
B by 3, each C by 2, each D by 1, and each 
failure by 0; and the sum of these is divid- 
ed by the total number of credit-hours for 
which the student is registered. An average 
percentage or point-hour ratio is worked out 
for each scholastic quarter or semester of 
attendance, and these term-averages are then 
themselves averaged to give a measure of to- 
tal school success. In getting the final av- 
erage, there is usually no attempt made to 
weight the successive term-averages by the 
number of credit-hours. This procedure gives 
a final average that is a reasonably valid 
measure of school success, but is not per- 
fectly reliable. The averages for succes- 
sive terms supply the necessary data for es- 
timating this reliability. 

The reliability coefficient of a test is 
usually defined as the correlation between 
that test and another one of similar nature 
but different subject-matter. More general- 
ly, the reliability of any set of measures 
may be defined as the ratio of the variance 
(squared standard deviation) of the systemat- 
ic factors or abilities underlying them, to 
the variance of the measures themselves. The 
writer has shown elsewhere!that if the two 
forms @& a test are equally reliable, the re- 
liability of either form, defined as above, 
is equal to the correlation between the two 
forms. In the case of measures of school 
success, however, we do not have any natural 
grouping of the measures into "forms," and 

attempt to them artificial 





jl. Edward E. Cureton, "Brrors of Measurement and Correlation, 


» PP- . 





varch, 1933 


lead to spurious correlation. Thus, for ex- 
ample, averages usually improve slightly as 
the student progresses through school; some 
students, due to outside activities, do bet- 
ter work regularly in the fall or in the 
spring; and large numbers of students take 
courses that run through two or more suces- 
sive terms with the same instructors, Fur- 
thermore, great numbers of students drop out 
of school before finishing, and since there 
is known to be a positive correlation be- 
tween scholastic average and persistence in 
school, the criterion groups for many impor- 
tant studies would be unwarrantably restrict- 
ed in range if we included in them only 
those students who graduated. The reliabil- 
ity of the average of a larger number of 
term-averages is higher than that of a snall- 
er number, so that the reliability of a cri- 
terion group will be unequally contributed 
to by its various members, 


RELIABILITY OF A SCHOLASTIC AVERAGE 


We shall assume, in the following discus- 
sion, that the termaverage (average percent- 
age or point-hour ratio) obtained by a stu- 


dent in one term is comparable in general to 
that obtained in any other term. This means 
that the true underlying abilities (includ- 
ing fundamental work habits, attitudes, and 
interests as well as scholastic intellicence) 
remain substantially the same from term to 
term; and that essentially the same abilities 
are required for success in one term as in 
another. This last assumption is obviously 
not literally true, since students as they 
progress through school take one sort of work 


in one term and a different sort in another. 


It is probably not in error to an extent 
sufficient to invalidate our argument, but 
even this is a matter of conjecture. We must 
also assume that the errors of measurement 
or variations in the successive term-averag- 
es are uncorrelated with one another and 
with the true underlying abilities; i.e., 
that a student who makés a high (for him) av- 
erage in one term is not thereby likely to 
make either a high or a low average in any 
other given term, and that one who makes 4 
high general average (as compared with oth- 
ers) is not likely on that account to be 
either more or less variable from term to 


E. E. Cureton 











259 


term than one who makes a low general aver- 
age. We must further assume that the errors 
of measurement are neither greater nor less 
in the earlier terms of a scholastic career 
than in the later ones. These assumptions 
also are not literally true. The first will 
be violated to the extent of the “halo ef- 
fect" whenever a student takes the same sub- 
ject for two or more successive terms under 
the same instructor. Inconclusive evidence 
would seem to favor the view that good stu- 
dents are a little less variable from term 
to term than are poor students, and that stu- 
dents are less variable in later terms than 
in earlier ones. We can again only hope, 
though again with some probability, to be 
sure, that we are correct, that these defi- 
ciencies are not great enough to invalidate 
the essential argument. 

Let X,, X,5+++,X, represent the successive 
term-averages of an individual, The term 
average, throughout the following treatment, 
will be the unit measure, corresponding to 
the score on one form of a test. Let x, be 
the true measure of scholastic success, or 
the underlying ability of the student. Let 
Sa9 Syseeey , DO the chance factors or er- 
rors of measurement in the successive term 
averages of the student. All the above de- 
fined measures are to be taken as deviations 
from their respective means for the group as 
origins, We have then, for n term-averages 
of an individual student, 


— = 
— * 


From the assumption that the true underlying 
ability remains the same from term to term, 
we have been able to use the same x, in 
each of these equations. The absence of 
weighting factors for x, implies the assump- 
tion that the units of measurement (scholas- 
tic standards and marking systems) remain 
the same from term to term. 

From equations (1) and the assumption that 
the errors of measurement (&) are uncorrelat- 
ed with the true abilities (x), we have for 
the groups of term-averages representing suc- 
cessive terms 

















ot, 


a + of 
§ 

+ € 
& 


,* % » (2) 
of 


= oe + oF 

From the assumption that the errors of meas- 
urement are of equal magnitude in successive 
terms, 


g* = o 


o = o* = eee = Sn §? say, 
and since x, is assumed to be the same 
throughout, 

ot = o S «avs © 


of = o*, say. 


Equations (2) may then be combined into the 
single equation, 


of = of 


Now according to our assumptions and def- 
initions, x, includes all causes of common 
variation in X,, Xp» eee » Xs and Sa, Sr, 
eee 6, include all causes of variation u- 
nique to their respective measures. If for 
any individual, therefore, we have two or 
more measures, the variance of these meas- 
ures (ny in number for the i-th individual) 
will be an estimate of the variance for him 
of the errors of measurement. The subscript 
i indicates that all values of a variable 
having this subscript represent measures of 
the same individual. Letting S be a summa- 
tion from 1 to n, we have then as an esti- 
mate of the variance of the errors of meas- 
urement of the i-th individual, 

ra S(Xi - Xi)? 
& m-l1 ° 





(4) 


This is the ordinary formula for any vari- 
ance, expressed in terms of raw scores. X, 
is a raw score (termaverage), and X, isthe 
mean of all such scores (n, in number) for 
the i-th individual, It is important, since 
n, is usually small, that we take n, - 52 as 
the denominator rather than simply n, 2 From 
the ordinary formula for a mean, X, = a. 


JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No, 3 


Substituting in (4) and reducing, 


sx} (SXs )? 

m,-1 n(n, -1)° 
This formula gives an estimate of the vari- 
ance of the errors of measurement for one 
individual (the i-th). It is applicable on- 
ly to those individuals, in a group of N, 
say, for whom two or more measures (termav- 
erages) are available. Suppose there are N' 
such individuals. Our estimate of of, the 
variance of the errors of measurement of all 
the nappa will be the mean of the N' 
values of o? - In computing this mean, each 
value of e. should be weighted by n,, the 
number of measures on which it is based, Let 
=' be a summation from 1 toN'. Then 


o2 = (5) 


Si 





Substituting for of , its value as given by 


(5) and reducing, 


n,SX¢ (SX; ) 
. etl - =e] 
(.) oe 


U'ny 
The reliability of the single term-aver- 
age may be written, according to our defini- 
tion as the ratio of the variance of the 
true measures or underlying abilities to the 
variance of the obtained fallible measures, 





2 
a, s S. 


The symbol R denotes a reliability coeffi- 
cient, and the subscript 1 indicates that it 
is the reliability of the single term-aver- 
age. From (3), 


R} = l- Se, (7) 


In this equation, of may be obtained from 
(6). We shall assume that the reliability 
of a term-average is in general the same for 
those who remain in school only one term as 
for those who remain two or more terms, in 
computing o%, we shall therefore consider 
all members of the group (N in number), rath- 
er than only those for whom two or more meas- 
ures are available (N' in number), as was 





1. R. A. Meher, Statistical Methods for Research Workers (Second Edition. 





London: Oliver and Boyd, 1928), pp. 51-52. 


March, 1933 


necessary in estimating the value of co. Let 
= be a summation from 1 to N. The best es- 
timate of o would be the average of the var- 
iances for successive terms, 


EXs (2x,)* 
oe Na- 1, Na(Na- 1)” 


The subscript a indicates that we are deal- 
ing only with term-averages for the first 
term. In like manner, we could obtain the 
variance for each successive term, Not all 
of the N subjects would be present in any 
one term, and hence N, would not equal Np», 
Ne,eeeyNne In obtaining o*, we should weight 
each Value, Ogy SbhyeeeyOn DY Na, Nd,.-e,Nn. 
Then, 


(8) 





Nao + Not + eee + Nno& 


' Na *+ Np + eee * Nn " 


a 


(9) 





If, as is often the case, the mean termav- 
erage in successive terms does not change, 
we may calculate o* as the variance of all 
the termaverages of all the students. In 
this case we have simply, 

=sx* 


_| zsx}* 
=n, on 


To obtain the reliability of the single 
term-average, we compute og from (6) and o* 
from (8) and (9) or from (10), and substi- 
tute in (7). This value may be important in 
certain theoretical discussions, but the re- 
liability coefficient that is usually de- 
sired in practice is the reliability of the 
total scholastic average. This total scho- 
lastic average, for any individual, is the 
mean of his several term-averages (n,; in 
number), Let A be the error of measurement 
in this total scholastic average, and let R 
be the reliability of the total scholastic 
average for a group. We then have as an ex- 
tension of (7), 





© 


o* (10) 





2 
Tr 
o 


(11) 


The value 4, it should be noted, is a mean: 
the mean, for each individual, of the n, vak 
ues of ;, Hence o, may be estimated for each 
individual; it is the sampling variance of 
the mean of the n, values of 53. By the us- 
ual formula for the sampling variance of a 
mean, 


E. E. Cureton 


/ each value of X; by ny. 





2 
e &. 
ny 


o? (12) 
oy 


261 


The value of, is an estimate of the sampling 
variance of the total scholastic average of 
the i-th individual. In making this esti- 
mate, of is to be preferred to of, as an es- 
timate of the variance of the infinite popu- 
lation of theoretical errors of measurement 
in successive termaverages. 

For the group of N individuals, of will be 
the mean of the N values of of,. 


o; = Nog . 
ony 
The value of oF in (11) should not be 

confused with the Sampling variance of the 
mean. It is the variance of the mean tern- 
averages of N different individuals, not of 
N sets or of an infinite number of sets of 
mean term-averages of the same individual, 
It is to be obtained by direct computation 
from the data, taking care of course to weight 
We have therefore, 


(13) 


5ny(Xy - X)? 
mn, 

The value X is the mean of all the termaver- 
ages of all the subjects. There will be ny, 

such termaverages all together, Substitut- 

ing for % and X, their values in terms of 

raw scores and reducing, 
(sx,)* 


n. 





2 
a 


5sx, |* 


ony 


r 
° (14) 


ony 


To obtain R, the reliability of the total 
scholastic average, we compute og from (6), 
using only those individuals (N' in number) 
for whom two or more measures are available, 
We then obtain of from (13), oF, from (14), 
and finally R from (11). 

The methods of computation may best be il- 
lustrated by a concrete example. Figure I 
shows the point-hour ratios (term-averages) 
in successive terms of 9 students, and the 
computations leading to the estimation of R. 
One would never, of course, compute a coeffi- 
cient of reliability from as few as 9 cases. 


APPLICATION TO THE RELIABILITY OF 
JUDGMENTS 


The foregoing analysis is not limited to 
the particular problem of scholastic averages. 














aerate ee — 


~ . 4 
(ner ee oer ag nt a emi ep ve 
- - 7 : IS ln 4 — 2 











A aR A ROE tit EE i 





JOURNAL OF EXPERIMENTAL EDUCATION 


FIGURE I 


Volume I, No, 3 
































Point-hour ratios (term-aver- 2 : 2 
ages), for the terms: P ny SXy (SX,) (SX;)* 
Student 1 2 3 4 ny SX; | SXy ny - 1 ny -l ny 
A 1.93 1.6 2.0 2.0 4 7.5| 14.17 18.893 18.750 14.062 
B 2.4 2.4 3.2 2.8 4 10.8 | 29.60 39.467 38.880 29.160 
Cc 1.7 1.9 1.9 3 5.5] 10.11 15.165 15.125 10.983 
D 3.1 3.6 3.5 3.4 + 13.6 | 46.38 61.840 61.653 46.240 
E 2.4 *1 *2.4 *5.760 
F 2.1 1.9 2 4.0 8.02 16.040 16.900 8.000 
G 1.9 1.4 2.4 2.0 + 7.7 | 15.33 20.440 19.763 14.823 
H 3.1 2.7 3.3 3 9.1] 27.79 41.685 41.405 27.603 
I 3.1 3.5 3.5 3.6 + 13.7 | 47.07 62.760 62.563 46.923 
Sny = 29 = n,SXx s Sx, 
ony CSX, oo (28844) Jo (SX; ) =r i) 
| ny -1 ny - ny 7 
































*S represents a horizontal summation along one line, of the 4 or less point-hour ratios (or their 
squares) on that line. £' is a vertical summation, not including cases where only one measurement 
is available (marked witn the asterisk,*). Dis a vertical summation including all the entries in 
the column. 


| _ 276.290 = 274.139 





mn, 


077 














ity tests against judgments. The practice 


has usually been to have each of a number of 


judges rate or rank all the subjects. With 
the method of obtaining the reliability co- 
efficient outlined above, it will be possi- 
ble to use more judges, and to have each 
judge rate only those subjects with whom he 
is reasonably well acquainted. It will be 
necessary, if rankings are used, to mlti- 
ply each ranking by the reciprocal of the 
number of subjects ranked by the judge who 
assiens it, or by some multiple of this re- 


ciprocal (the same multiple for every judge), 


It has at least one other very important use, 
in the validation of character and personal- 








in order not to overweight the rankings of 
those who rank the larger numbers of sub- 
jects. Each of the ratings of an individual, 
or each of the weighted rankings, may be 
treated as if it were a single term-average 
in the preceding analysis. 


REGRESSION OF AN ESTIMATED "TRUE" CRITE- 
RION UPON ONE OR SEVERAL "INDEPENDENT"2 
VAR IABLES 


The usual method used in validating a bat- 
tery of tests is to compute the equation of 
multiple regression of the criterion upon the 
several tests. The criterion scores in this 





1. An "independent" (in quotation marks) variable is anyone whose observed values are used to predict the values of a 
criterion or "dependent" variable. Mathematical independence is not implied. 













gE 


“ee 


March, 1933 E. 
equation should properly include only the 
systematic causes of variation, The normal 
equations in their final form include only 
the standard deviation of the criterion 
scores and the correlations between the cri- 
terion and the several "independent" varia- 
bles. We must therefore derive estimates of 
the standard deviation of the "true" scores 
in the criterion variable (the true underly- 
ing abilities), o,; and of the correlation 
between a "true" criterion and an "independ- 
ent" variable, r,. 
Let x, be a criterion score, 
Xoo the corresponding "true" criterion 
score, 
6, the error of measurement in an ob- 
tained criterion score, 
X, @ score in an "independent" varia- 
ble, 
All these scores are to be taken as devia- 
tions from their respective means for the 
group. Then, 
§ (15) 


Xo = Xo * Yos 


and according to our previous definition, 





Too 


(16) 


Ro 


| 
| 


so that 


Ceo GoVRo « (17) 


This is the formula for the standard devia- 
tion of the "true" scores, Continuing, 


= 2%%X1 | 
NQ&C}1 
From (15), 


=x, (x, - §) 
NO} To 





Teo 


Since §, is an error of measurement, we may 
assume that it is uncorrelated with x;, 80 
that =x,6, = 0, and 

=X X_ 

NO} 0% ° 


To 


Then from (17), 





_ LX\Xo 


” "NO, /R,? 


Tol 


Cureton 


Similarly, 


Toco (19) 


Tog 
a VR? etc. 
Equations (18), (19), and others like them 
give the correlation between a "true" crite- 
rion and an "independent" variable. 

The normal equations for a problem involv- 
ing a criterion and n "independent" variables 
may be written, 


0. 


—= 2666 - t,f, * 
= 0. @0) 


r. 


col B, - 146, 


w2 ~ Tip, - Se — oe = Toe, 


- Ted - oe - By = 0, 
These equations may be solved for), Go,-<6, 
4 by any of the well-knowm methods, such as 
the Doolittle method, the Kelley-Salisbury 
iteration method, etc. From the solution we 
obtain the partial regression coefficients 
or sub-test weights according to the equa- 
tions, 


o 
Toon ~ Tin) 


by 
dp 


The multiple correlation coefficient, whose 
value (or the value of whose square--the ar- 
gument for the 1“*ter has been given by the 
writer elsewhere) is the coefficient of 
practical validity of the battery, may be 
obtained from the relation, 


Too (128...n) = Allo. * Bolo2 + oe + Malone (22) 


It should be noted finally that the whole 
analysis outlined in the present paper is 
limited in its usefulness to fairly large 
samples--samples in which the difference be- 
tween N' and the largest n, is greater than 
30 or 40, 





l. The reliability of a test is equal to the square of its correlation with the ability underlying it. 


Analogously, we 


might define its validity as the square of the correlation between the test scores and the ability underlying the cri- 


terion scores. Op. Cit., p. 29. 

















Pa ’ 
- - re 
an a A SORE — eee ammo — 


= 


« 


. id tine. 
: * : 
- m . 
etnies to: Anmeetineeieieslin ees 


“ my ate me a ee ee ee eee ae 


4 ye —_ , 
. m, 
Se eye 





264 JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No. 3 


THE DIFFERENTIAL PREDICTIVE VALUE OF THE PSYCHOLOGICAL EXAMINATION 
OF THE AMERICAN COUNCIL ON EDUCATION 
J. Virgil Waits 
Graduate Student 
Columbia University 


It is the purpose of this paper to dis- 
cuss the general predictive values of the 
Psychological Examination of the American 
Council on Education, a test of mental abil- 
ity, used at the Alabama Polytechnic Insti- 
tute. The main problem is to discover what 
possibilities there are that this test may 
reasonably be expected to differentiate be- 
tween the chances of success in the differ- 
ent divisions of the college. We shall also 
attempt to determine whether by reweighting 
the subtests of the examination we might not 
arrive at a better value to be used in guid- 
ing students than is provided by the total 
score. We shall note, incidentally, the re- 
liability of the present scholastic average. 


DESCRIPTION OF TEST 


The Psychological Examination of the Amer 
ican Council is used in some 250 colleges and 
universities, being one of the most widely 
used tests of its kind. A new form of the 
test appears annually. The form for 1927 
was used in this study. The test is composed 
of five subtests, and has a total possible 
score of 355. A short description of the sub- 
tests is given below: 

The Completion Test, designed to measure 
verbal ability, consists of 40 items. A 
weight of 2 points is assigned for each cor- 
rect response. The directions for the test 
instruct the student to "Think of the most 
appropriate word to complete each of the sen- 
tences, The number in each space indicates 
the number of letters in the most appropri- 
ate word for that space." The scoring is 
comparatively objective, alternate answers 
being allowed, provided they conform to the 
number of letters in the space. Two words 
cannot be used and no word is correct which 
does not contain the specified number of let- 
ters . 





The Artificial Language Test is 4 substi- 
tution test consisting of 20 sentences con- 
taining a total of 74 words. A weight of 
one point is given for each correct word, In 
case an entire sentence is omitted all sub- 
sequent answers are disregarded in the scor- 
ing. An artificial vocabulary is given 
which substitutes certain nonsense syllables 
for words. Rules for the formation of plu- 
rals, past and future tense of verbs, verbal 
nouns, adjectives, and adverbs are given, 
Half of the sentences listed call for trans- 
lation of English sentences into the arti- 
ficial language and the other half for the 
reverse process. The student is directed 
not to try to memorize the vocabulary or forms 
but to consult them freely while translating, 
The scoring is objective. 

* The Analogies Test (space type) consists 
of 24 items, to each of which is assigned a 
weight of 2 points for each correct response, 
It is designed as a test of ability to see 
relationships between various types of dravw- 
ings and figures. Each line in this test 
contains eight drawings or figures, the first 
three of which are separated from the last 
five by a space. The second drawing in each 
line is a modification of the first, and the 
student must find in the last five drawings 
one which has the same modifications of the 
third drawing, i.e., find one which bears 
the same relation to the third as does. the 
second to the first. The scoring is objec- 
tive. 

The Arithmetic Test contains 20 problems 
of increasing difficulty. Four points are 
assigned for each correct response. The scor- 
ing is objective. 

The Opposites Test, another test of ver- 
bal ability, contains 27 items. Three points 
are given for each correct response. The 
directions read: "Each group of four words 
in the lines below contains two words that 


March, 1933 


are either (a) the same or nearly the same, 
or (b) the opposite or nearly the opposite 
in meaning. Find the two words in each group 
that are either the same or opposite, and 
write the numbers of these words in the column 
at the right, headed 'Samé' or the colum 
headed ‘Opposite! as the case my be." No 
partial credit is given; the scoring is ob- 
jective. 


THE CRITERION 


The criterion of success in college used 
in this study is the scholastic weighted av- 
erage of the student. On computing this av- 
erage the number of credit-hours for each 
course is multiplied by the grade for that 
course and the total of all these products is 
divided by the total number of credit-hours 
for that semester. The mean of the semester 
averages was taken as the measure of college 
success, In obtaining this mean the number 
of credit-hours was not used as a weight, as 
the inclusion of such weighting would not 
materially affect the results. Records were 
obtainable for all students who remained in 
school one semester or more. 

The scholastic grades at the Alabama Poly- 
technic Institute are reported in per cents, 
4 grade of 60 per cent is required for pass- 
ing and the average grade of the college is 
Slightly less than 75 per cent. There are 
about 10 per cent of failures and the range 
in grades for a semester is, approximately, 
50 to 95 per cent. The average of the whole 
group included in this study is 72.98 per 
cent. 


COMPOSITION OF GROUPS STUDIED 


In order to check for the differential 
predictive value of a test, it is necessary 
to have two or more groups that may be ex- 
pected to vary in the elements necessary for 
success, In grouping the students for this 
study an effort was made to bring together 
in each group those departments of the col- 
lege whose curricula, as shown by an analy- 
Sis, were similar. The writer is suffering 
from no illusion that the grouping arrange- 
ment of this study gives groups in which the 
required abilities are identical. It would 
appear, however, that basically the college 











J. V. Waits 265 


is divided into three groups. Two of these 
are apparent at @ glance, agriculture and 
engineering. The third group is more varied, 
This group includes the general department, 
the Education department, and the department 
of business administration. In all of these 
departments verbal ability would seem to be 
of prime importance. We have therefore 
placed these three departments into one group, 

Table I shows the composition of the three 
groups, the persistence in each school of 
the students in each group, and the depart- 
ments of the college in each group. Through- 
out this study the three groups will be 
termed: agricultural, engineering and gen- 
eral, 

TABLE I 
SHOWING THE COMPOSITION OF THE THREE GROUPS 


AND THE PERSISTENCE OF STUDENTS IN EACH 
GROUP IN COLLEGE 


AGRICULTURAL 
GROUP 








Agriculture 
Agricultural 
Education 
Home Econom- 

ics 
Veterinary 





Total 


ENGINEERING 
GROUP 








Architecture 
Chemical 
Engineering 
Civil Engi- 
neering 
Electrical 
Engineering 
Mechanical 
Engineering 
Textile Engi- 
neering 
Pharmacy and 
Pre-Medical 


Total 
GENERAL GROUP 











Business Ad- 
ministra- 
tion 

Education 

General 


13 1 5 14. 67 
4 3 2 3 2 
3 oF ll 6 

28 15 


Total 20 414 





Total All 


Groups’ 401 


172 11 34 14 40 19 69 42 











*Number of students of Agriculture who remained 
in school 8 semesters, etc. 








- 


sak seem re onqemnatnes 


_ 


i 
| 
f 
K 


ne 


ee 
7 


—w> 








a 


RELIABILITY OF THE CRITERION 


There are a number of factors which act to 
prevent the scholastic average from being 
highly reliable. Since these factors would 
tend to lower the correlation between schol- 
astic success, as measured by per cent aver- 
age, and any variable which we might wish to 
correlate with such averages, we must com- 
pute the reliability of the criterion and corm 
rect for attenuation. 

The coefficient of correlation between two 
forms of the same measure is ordinarily tak- 
en as the reliability coefficient. A number 
of difficulties are inherent in this method 
when we attempt to apply it to the present 
study. Among these difficulties are: (1) Any 
method of dividing the semester averages into 
two halves mist be arbitrary, and an entire- 
ly satisfactory method of division is not ap- 
parent; (2) in the case of the student for 
whom an odd number of semester averages was 
available, the number of averages in the two. 
groups would be unequal; and (3) where the 
student remained only one semester, it would 
have been impossible to compute the reliabil- 
ity of averages. 

The technique for computing the reliabil- 
ity of the scholastic average has been pre- 
sented by Cureton. The validity of the 
mathematics of the technique will not be 
discussed here since it is outside the scope 
of this paper. Only a short description is 
given and those desiring a full discussion 
should refer to the original paper. 

The estimate of the variance for each in- 
dividual is expressed by the formula, 


, sx? (sx,)° 
6; mg =1l  ny(ny = 1) 





(1) 


in which Os, 18 the variance of the individ- 
ual, S is the summation from 1 to n, n is the 
number of averages for each student, sxi is 
the sum of the squares of the averages, and 
(Sx,)* 1s the square of the sum of the meas-~ 
ures for an individual, The value of o, for 
the group is found from the formula, 


o, = Ono,’ )/Z's n (2) 


JOURNAL OF EXPERIMENTAL EDUCATION 








Volume I, No, 3 


in which of is the variance for the group of 
any single semester average. The other 
terms in this equation are defined in con- 
nection with formla (1). 

In our case, however, what is necessary 
is the variance of the total scholastic av- 
erages of N individuals, each of whom has n 
semester averages. Representing this meas- 
ure by oF we have the forma, 


0, = NoJ/ in 


(3) 
This formula gives us the variance of the 
group about group mean. It might be noted 
that in this formla we are able to use all 
of the cases for which records are available, 
It was necessary to establish the vari- 
ance of the individual means about the group 
mean. Representing this by oy, we have 


a 5((SX4)* ny) 
om =n 





- ([Sx/in)* = (4) 


These derived values furnish us with the 
data for the calculation of the coefficient 
of reliability, R, from the following form- 
la, 

21 « ot/a* 
R=l-o /oy, (5) 

The reliability coefficients of the semes- 

ter averages for the three groups are: 


Agricultural Group ...... 0.938 
Engineering Group ....... 0.915 
General Group .cceceseeee 0.859 


Using these values we corrected for at- 
tenuation in the criterion by the formula, 
Tie 


Pir 
in which r,, is the corrected value of the 
correlation between the criterion and the 
subtest score under consideration. In Tables 
IIA, IIIA, and IVA, the first colum gives 
the uncorrected correlation and the second 
column the corrected correlations. 


(6) 


Toe s 





l. Edward E, Cureton, Validation Against a Fallible Criterion, (This Journal: 


Preceding Article). 


march, 1933 J. V. Waits 267 


CALCULATION OF SUBTEST WEIGHTS II, III, and IV, and IIA, IIIA, and IVA re- 
spectively. The correlation between the 


The means, standard deviations, and inter~ | total psychological score and the criterion 
correlations for the subtests and scholastic | are also given. 
average for each group are given in Tables 





TABLE II 


AGRICULTURAL GROUP. MEANS AND STANDARD DEVIATIONS OF THE SUBTESTS 
AND SCHOLASTIC AVERAGE. (N = 86) 





Scholastic| Correlation Artificial 
Average Scholastic | Completion Language | Analogies | Arithmetic | Opposites 


Average 





Mean 73.14 
Standard 
Deviation 8.01 





























TABLE IIA 
AGRICULTURAL GROUP. eee =) SUBTESTS AND SCHOLASTIC AVERAGE 
N = 86 





Scholastic| Correlation Artificial 
Average | Scholastic | Completion | Language Analogies | Arithmetic 
Average 





Completion -3903 4029 
Artificial 

Language 3858 3983 
Analogies -4061 
Arithmetic 3341 3449 
Opposites 2798 





Total 

Psycholog- 
ical 
Score 























TABLE III 


ENGINEERING GROUP. MEANS AND STANDARD DEVIATIONS OF THE SUBTESTS AND 
SCHOLASTIC AVERAGE. (N = 212) 





Scholastic | Correlation | Artificial 
Average | Scholastic | Completion; Language | Analogies | Arithmetic | Opposites 


Average 





Mean 74.21 16.06 14.96 


Standard 
Deviation 9.83 13.18 9.75 



































« ° . 
"a - » : 
- 7 “ 


* 


o a ‘ 
« eet eres ee ee i ect Ra ly eae oe 


Manel) deere Paathaanaith oo at 


 \ 


bal - 
egy em 


JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


TABLE IIIA 


ENGINEERING GROUP. ne OF =) SUBTESTS AND SCHOLASTIC AVERAGE 
N = 212 ' 





T —— T 
Scholastic | Correlation | 


Artificial 


Average | Scholastic Completion | Language | Analogies | Arithmetic 





| Average | 
ee 3286 | -3435 | 
| 


Artificial 

Language e2599 | «2717 
Analogies 3726 - 5895 
Arithmetic - 4307 -4502 
Opposites - 3333 - 3484 








Total | 
Psy cho- 

logical 
Score | 24484 - 4687 




















TABLE IV 


GENERAL GROUP. MEANS AND STANDARD ae i ¥ THE SUBTESTS AND SCHOLASTIC AVERAGE. 
N = 103 





| Scholastic | Correlation | Artificial 
Language | Analogies | Arithmetic | Opposite 


Average | Scholastic | Completion 
Average | 


a 





Mean 70.29 | | 17.69 
Standard 
Deviation 





19.30 
8.48 























TABLE IVA 


GENERAL GROUP. scien OF — AND SCHOLASTIC AVERAGE. 
N = 103 





| 


Scholastic; Correlation 


Artificial 


Average | Scholastic Completion Language | Analogies | Arithmetic 


Average 





Completion -3424 - 3694 
Artificial 

Language -3105 - 3350 
Analogies -5129 - 5534 
Arithmetic - 3348 - 3612 
Opposites 3466 3740 





Total 
Psycho- 
logical 

Score 























Taken as a whole, the correlations of the 
subtests with the criterion were rather low, 
ranging from .27 to .55. This was not unex- 
pected since the total time for the whole 
examination is about one hour, Thus each 
subtest receives an average of about twelve 
minutes, In general the correlations of the 





subtests with the criterion were slightly 
higher than the intercorrelations of the sub- 
tests with each other, and lower than the 
correlation between the criterion and the to- 
tal psychological score. The correlation of 
the total psychological score with the cri- 
terion was, approximately, .47 for the 





varch, 1933 J. V. Waits 289 


Engineering Group and .53 for the Agricul- |\for the Opposites Test would seem to be that 
tural and General Groups. The chief use of | it is due to sampling errors of a small 
these correlations is for further computa- | group (86). 
tions. In the engineering group the weights vary 
The partial regression weights of the va- | somewhat. The weights for the Completion, 
rious subtests were obtained from a set of j|Analogies, and Opposites Tests are approxi- 
normal equations, set up in accordance with | mately the same, while that for Artificial 
the formulae given by Kelley.2 These par- Language is negligible. One notable feature 
tial regression weights or path coefficients” of this group is the high weight of the 
are listed in Table V. To obtain them the Arithmetic Test, the effective weight for 
normal regression equations were solved by a | this test being more than twice that for any 
modified form of the Doolittle method® pre- other subtest. This was rather expected. 
sented by Peters and Wykes .4 A discussionof | The curriculum of the engineering department 
the method is given by Mills,° contains a large amount of mathematics for 
which arithmetic forms a foundation. 
TABLE V In the general group we find a very even 
STANDARD REGRESSION COEFFICIENTS. (PATH weighting for four of the tests end an en 
COEFFICIENTS OR EFFECTIVE WEIGHTS) tremely high weighting for the fifth test. 
The Analogies Test has an effective weight- 
Subtest Agricultural|Engineering|General | ing of .4208, which is practically five times 
as high as the weight of the next highest 
oe —_ — — test. This seems to be somewhat out of line. 
sates’ aa “te = It would appear that, for this group, the 
Arithmetic "1395 * 2896 “0553 Completion and Opposites Tests should have a 
Opposites -.0078 - 1082 -0719 high weighting. However, the correlation be- 
tween the Analogies Test and the criterion 
These weights are the effective weights of is considerably higher than that found for 
each subtest. They give the relative impor- any other subtest. In fact, it is higher 
tance of each subtest in predicting the cri- | than the correlation between the scholastic 
terion. | average and the total psychological score for 
In the agricultural Group we find the this group. There is a possibility that we 
weights of the Completion and Analogies Tests | have underestimated the importance of the 
corresponding closely, as do the weights of | Analogies Test. Nevertheless, granting that 
the artificial language and arithmetic. The | the Analogies Test is more important for 
Opposites Test with an effective weight of this group than we would naturally suppose, 
-.0078 is difficult to explain. In all three | it is hardly conceivable that it has the im 
of the groups the highest intercorrelation portance which this weighting would seem to 
found between any two subtests was between indicate. 
Completion and Opposites. This was expected 
Since both are tests of verbal ability. As GENERAL PREDICTIVE VALUE OF THE BATTERY 
may be noted, the effective weights of the 
Completion and Opposites Tests in the engi- The determination of the general predic- 
neering and general Groups are essentially tive value of the test involves the multiple 
the same. Since the effective weight of the | correlation coefficients for the three groups, 
Completion Test for the Agricultural Group These miltiple correlation coefficients be- 
is .2515, the best explanation of this -.0078 | tween the criterion and the subtests for the 
































1. Truman Lee Kelley, Statistical Method, (New York: The Macmillan Company, 1924), pp. 295-502. 

2. S. Wright, Correlation and Causation," Journal Agricultural Research (1921), pp. 557-585. 

3. Method developed by M. H. Doolittle of the U. S. Coast and Geodetic Survey about 1875. 

4. C. C. Peters and E. C. RWykes, "Simplified Methods for Computing Regression Coefficients," Journal of Educational Re- 


search, 25, pp. 585-595. 
5. Frederick C. Mills, Statistical Methods, (New York: Henry Holt and Company, 1924), Appendix A, pp. 276-261. 














270 


three groups are given in Table VI, together 
with the zero-order coefficients of corre- 
lation between the scholastic average and the 
total psychological score. The results in- 
dicate that the weighted sum of the scores on 
the subtests is, in this case, only slightly 
better in predicting scholastic success than 
is the total score of the psychological ex- 
amination. The correlation between the to- 
tal score and the scholastic average is 4a- 
bout .50 for the three groups, and the mlti- 
ple correletion coefficient is about .04 
greater. Since the same population is in- 
volved this difference represents an actual 
gain of that amount in predictive value, This 
increase in accuracy may not, however, com- 
pensate for the amount of calculation neces- 
sary to obtain the multiple correlation co- 


efficients. 


TABLE VI 


COMPARISON OF THE MULTIPLE CORRELATION 
COEFFICIENTS AND THE ZERO-ORDER COEFFI- 
CIENTS OF CORRELATION BETWEEN THE CRI- 
TERIJN AND THE TOTAL PSYCHOLOGICAL SCORE 














Multiple Zero-order 
Group Correlation Coefficient 
Coefficient 
Agricultural -5510 5275 
Engineering -4978 - 4687 
General 5769 - 5307 








DIFFERENTIAL PREDICTIVE VALUES 


The chief purpose of this study is to de- 
termine whether the weighting of the various 
subtests is sufficiently different for the 
three groups to warrant differential predic- 
tion of success in the groups. To do this 
it was necessary to obtain: 

(a) The regression coefficients for 
group which are shown in Table VII. 

(bd) The standard errors of those subtests 
for which comparisons were to be made, 

(c) The standard error of the difference 
between any two corresponding coefficients of 
the regression equations, 

(d) Significance ratio of the actual dif- 
ference between the corresponding subtest 
weights to the standard error of the differ- 
ence. Using this ratio we were able to con- 
eult Sheppard's table of the probability in- 


each 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


tegral and find what probability there is 
that the differences noted are significant, 
The values obtained are listed in Table VII]. 








TABLE VII 
REGRESSION COEFFICIENTS OR ACTUAL SUB- 
TEST WEIGHTS 
Subtest Agricultural| Engineering! General 
Completion 2367 -0873 - 0386 
Artificial 
Language -1952 -0126 -0579 
Analogies - 1576 -0881 2021 
Arithmetic - 0584 - 2288 0304 














In some cases there is @ very small dif- 
ference between the pairs of regression 
weights for the different groups, so smll 
that we may be sure that the differences 
would not be significantly greater than their 
standard error, Such cases will possess 
small differential predictive values. With 
this in mind, only those cases which showed 
fairly large differences were investigated. 


TABLE VIII 


THE SIGNIFICANCE OF THE DIFFERENCES OF THE 

PARTIAL REGRESSION WEIGHTS FOR VARIOUS 
PAIRS OF GROUPS, AND THE CHANCES IN A THOU- 
SAND THAT THE DIFFERENCES WOULD OCCUR BY 








CHANCE 
Number of times 
in a thousand 
Subtest Sig- the difference 
Weights Groups nificance} would occur by 
Compared (|Compared| Ratio chance 
Arithmetic! E&G 2.414 16 
Arithmetic| E&A 1.23 218 
Completion; A&G 1.48 138 
Artificial 
Language; E&A 1.18 238 
Artificial 
Language; A&G 0.871 384 
Artificial 
Language| G&E 0.462 - 645 














An examination of Table VIII shows: (a) 
that there are 16 chances in a thousand that 
the difference in the weighting of the Arith 
metic Test for the engineering and general 
groups is due to chance. This means that the 
Arithmetic Test has substantial predictive 
value in differentiating between the engi- 
neering and general Groups. (b) In the Arith- 
metic Test for the engineering and agricul- 
tural Groups there are at least two chances 
in ten that the difference may be due to 


J. V. 


varch, 1933 


chance, hence the difference is only indica- 
tive. (c) In the Completion Test for the 
agricultural and general Groups there is one 
chance in seven that the difference is due 
to chance, and the difference is again only 
indicative. (d) The Artificial Language Test 
shows no differential predictive value in 
differentiating between any two groups. 

The investigation of the Artificial Lan- 
guage Test demonstrates clearly the truth of 
the assumption that the smaller b-regression 
coefficients needed no investigation, 


CONCLUSIONS 
1. The reliability of the scholastic av- 


erages for the various groups is high, av- 
eraging .904, 





Waits 271 

2. A reweighting of the subtests increased 
the predictive value of the examination. It 
raised the correlation between the scholas- 
tic average and the total psychological score 
about .04, 

3. There is not eufficient difference in 
the weights of a given subtest for the dif- 
ferent groups, Agricultural, Engineering, 
and General, to warrant differential predic- 
tion of scholastic success in college on 
the basis of the Psychological Examination 
of the American Council on Education. Enough 
difference was found, however, to indi- 
cate that a test might be constructed which 
would have differential predictive value 
for various college groups. 








7ST Slee lS SP ee ee 


JOURNAL OF EXPERIMENTAL EDUCATION 


Volume I, No, 3 


ON THE LIMITS OF PREDICTING SCHOLASTIC SUCCESS 
Howard Easley 
Duke University 
Durham, North Carolina 


It is a matter of common knowledge that 
correlations between intelligence test scores 
and scholastic achievement are usually not 
sufficiently high to be of creat prognostic 
value except in cases of extreme variation. 
It ts probably safe to say that the majority 
of such correlations lie between .40 and .60 
with the mean not far from .50. Prognosis 
on the basis of a measure correlating to this 
extent (.50) with scheol success will elim 
inate only 13.4 per cent of the error inci- 
dent to cuessing and will establish only 
relatively broad limits of probable success 
for students with a given intellectual abil- 
ity. 

A number of factors may theoretically ac- 
count for this apparently low relationship, 
the most obvious of which are: (1) Intelli- 
gence and scholastic ability may not be more 
closely related than the correlations indi- 
cate. As is commonly noted by teachers, 
other factors than ability are large deter- 
miners of scholastic success. Probably the 
most commonly noted factors are industry, ad 
justment, background of previous training, 
and outside activities, (2) The reliabili- 
ties of the measures used may be low. And 
(3), the validity of one or both of the meas- 
ures used may be low. This factor may or 
not be related to the second, 





The relative significance of these factors 
is of considerable importance in determining 
the course of future efforts at prognosis, If 
factors (2) and (3) are the determining ones, 
then we need only to improve the types of 
measures which are now being used, If, on 
the other hand, (1) is the determining fac- 
tor we shall have to devise other measures, 
or at least make greater use of other types 
of measures than intelligence test scores 
and school marks, 

Two recent writers have dealt with cer- 
tain aspects of factors (2) and (3). Hulll 
has apparently taken the position that the 
trouble lies largely with our tests when he 
says: "The present indications are that un- 
less some more or less radical improvement 
in test construction is discovered, psycho- 
logical tests will be forever doomed to oper- 
ate at an efficiency under SO per cent, prob- 
ably under 40 per cent, and very possibly the 
average efficiency will not rise much above 
25 per cent or 30 per cent.*@ 

Huffaker>, on the other hand, insists 
that our tests are fairly adequate, but that 
the lack of correlation is due to the unreli- 
ability of school marks, He says: "It may 
be concluded that before there is any marked 
improvement in the predictive value of tests, 
there must be an increase in the reliability 





1. Clark L. Hull, "The Correlation Coefficient and Its Prognostic Significance." 


(1927), pp. 222-258. 
2. The predictive efficiency of tests is measured here by the coefficient of alienation. 
accurate for determining the amount of error in predicting for the whole range of abilities, is somewhat misleading. 


-60, and the predictive efficiency is 1 - kjp or .40. 


Journal of Educational Research, XV 
This coefficient, although 





Now if we take 


Suppose, for example, that r)o = .80, then k)o = 
: : : ¥,, their scores on 2 will average 1.60 below the mean of 2, and the 


those cases which are two ¢ ow the mean o 


standard deviation of these scores around their mean will be .6c of the whole distribution of scores. 


We can be prac- 


tically certain that any individual whose score in 1 is 20 below the mean will not make a score in 2 higher than 
3 x .60 - 1,60, or .20 above the mean of the whole distribution. For this group the range of scores on 2 is limited. 


For this group the range of scores is limited not by 40 per cent but by 46.7 per cent. 


Furthermore, the number of 


this group falling above the mean of 2 is 1.0 per cent rather than 50 per cent as we should expect if no predicting 


instrument were used. 


Suppose our problem is college entrance and our criterion is probability of graduation. If the 


grade requirement for graduation is C, the average grade, then the predictive efficiency of a test correlating with 
grades to the extent of .80 would be 99 per cent for those individuals 20 below the average on the test. This is a 
rather extreme case, but a real one of the type which should be kept in mind when interpreting correlations in terms 


of the alienation coefficient. 
in a note at the end of this article.) 


C. L. Huffaker, "Predictive Significance of the Correlation Coefficient." 


pp. 46-45. 


(The statistical processes above, as well as certain others used later, are explained 


Journal of Educational Research, XXI (1950), 





March, 1933 


of the measures of scholastic success. The 
present need is not a ‘radical improvement 
in test construction! but a radical improve- 
ment in the measures of scholastic success. 
If the criteria of scholastic success were 
as reliable as the present psychological 
tests, it is probable that prediction would 
be satisfactory in three cases out of four." 
He arrives at this conclusion on the basis 
of the assumption that teachers! marks rare- 
ly have a reliability above .75. In this 
case the correlation with test scores could 
not possibly be, except by pure chance, a- 
bove .866, and the predictive efficiency 
could not be above .50. 

The problem of the present study is to 
evaluate some of the factors mentioned above 
and to determine the limits of predicting 
school marks such as are now used. We can- 
not determine the true relationship between 
intelligence and scholastic ability without 
knowing the validity and reliability of the 
intelligence tests and school marks used, 
The reliabilities of these measures can be 
determined, but we have no entirely objec- 
tive measure of their validity. If we as- 
sume that the tests measure intelligence and 
school marks measure scholastic success, to 
the extent that these two measure anything, 
our question becomes particularly: Is the 
lack of correlation between school marks and 
intelligence tests due to a real lack of re- 
lationship between intelligence and school 
success, or to the unreliability of the meas- 
ures used? 

The records of 312 men entering the fresh- 
man class of Duke University in the fall of 
1930 form the basis of the study. A total 
of 411 men took the intelligence test, but 
of these, 99 failed to register, dropped out 
before the end of the first semester, or had 
one or more grades reported as incomplete at 
the end of the semester, All entering stu- 
dents took the test except a very small num 
ber, these, for the most part, being late 
registrants. The subjects used, then, in- 
Clude all those who took the test upon en- 
trance, and for whom grade reports were com- 
plete on all five subjects at the end of the 
first semester. 

The intelligence test used was the 1930 
edition of the Psychological Examination, 


H. Easley 





prepared by Thurstone for the American Coun- 


273 


cil on Education. The grades in the under- 
graduate school at Duke are A, B, C, D, Inc., 
and F, The first four are passing grades. 
No numerical equivalents are recognized, ex- 
cept a system of quality points in which a 
grade of A carries three points per semester 
hour, B two, C one, D none, and F minus one, 
The writer determined from the distribution 
of grades reported the first semester a sys- 
tem of numerical equivalents as follows: A 

= 4.8; B= 3.8; C = 3.0; D = 2.3; and F = L5., 
These numerical equivalents were used in de- 
termining the reliability coefficients for 
the first semester's grades, since only a 
small number of grades for each student were 
involved, thus making the possible number of 
categories small if quality points were used. 
In the later studies involving a whole semes- 
ter's grades or more the quality points were 
used, There is no obvious reason to believe 
that the use of these two different kinds of 
measures affected the results materially. 

The correlation between intelligence test 
scores and crade averages for the first se- 
mester was .510. This indicates a predictive 
efficiency of 14 per cent, as measured by the 
alienation coefficient. The reliability of 
the Psychological Examination, as determined 
by Thurstone, is .95. Its low predictive ef- 
ficiency cannot be due in any large measure 
to its lack of reliability. If its reliabil- 
ity were increased the maximum correlation 
with these grade averages would be - or 
.523. But what of the unreliability of school 
marks? 

The 
lastic 
timate 


smallest unit of measurement of scho- 
success is the semester grade, An es- 
of the reliability of such measures 
may be obtained by correlating the grades re- 
ceived in a particular subject with the grades 
received in a similar subject the following 
semester. This method, however, introduces 
whatever new variables are incident to the 
change from one semester to another, Conse- 
quently it seems that a more satisfactory es- 
timate of the reliability would be obtained 
by correlating one part of a semester's work 
with another part, and stepping up the coef- 
ficient thus obtained by means of the Spear- 
man-Brown formula. The grades of a semester 
were divided by the following procedure: 

All of the grades reported were recorded 
in a given order, A given grade for one 





274 


student did not, however, represent the same 
subject as the same grade for another stu- 
dent, since a wide variety of subjects were 
studied. The two following cases, selected 
at random, will serve to illustrate. Thefive 
subjects for one student, in the order re- 
corded, were: English, Mathematics, Bible, 
Economics, and German, The five subjectsfor 
another student, in the order recorded, were: 
English, Bible, History, Physics, and French, 
Thus correlations between various grade com- 
binations are not merely correlations be- 
tween particular combinations of subjects. 

For five grades there are 10 possible com 
binations of two-grade averages. These com 
binations yielded 15 intercorrelations, with- 
out using the same grade twice in the same 
correlation, with an average of .596. This 
correlation may be regarded as the lower lim 
it of the reliability of a two-grade average 
for one semester, The theoretical standard 
error of this average correlation is .010, 
The standard error of the average of the em- 
pirical distribution of the 15 correlations 
was .013. In view of the comparability of the 
measures used, this low empirical standard 
error justifies the use of the Spearman-Browm 
forma for determining the reliability of the 
semester average, which is found to be .785,. 
If our instrument of prediction were perfect 
it should correlate perfectly with a perfect 
measure of scholastic success, and it should 
correlate to the extent of y.785, or .886 
with school marks, even as unreliable as they 
are. Such a correlation would indicate a 
predictive efficiency of 53 per cent instead 
of the 14 per cent which we actually found, 
The difference of 39 per cent mst be due al- 
most entirely to a lack of relationship be- 
tween the two variables and not to the un- 
reliability of our measures, 

At the end of the academic year 1931-32 
the records for 198 of the students were com 
plete for two years' work. The other 114 
had either dropped out of school or had one 
or more grades reported as incomplete. With 
these 198 students the correlation between 
the number of quality points earned in the 
first and second years was .784. Since these 
subjects represented a restricted selection 
of only those who survived two years of col- 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


lege work the range of quality points was 
narrower than that of an unselected group of 
students, and consequently the correlation 
is smaller than the true correlation between 
two years' work, This error may be correct- 


o_VI-R 
~ WVi-r 


is the standard deviation of the selected 
distribution, & is the standard deviation ofr 
the unselected distribution, and r and R are 
the corresponding correlations. Table I 
gives the reliabilities of one-, two-, and 
four-year averages, and the limits of possi- 
ble correlation between these averages and 
any predicting instrument. The figures for 
the two- and four-year averages were derived 
by the Spearman-Brown formula, 


in which o 


ed by the formulal 


TABLE I 


THE RELIABILITIES OF ONE- 


TWO-, AND 
FOUR-YEAR AVERAGES 








1 year 
2 years 
4 years 














With an unselected group of students, or 
rather a group selected as they are for col- 
lege entrance at Duke, it would, theoretical- 
ly, be possible to devise some measure which 
would correlate .972 with four-year averages, 
-946 with two-year averages, and .905 with 
one-year averages. The predictive efficien- 
cy of such a measure with the maximum corre- 
lation with grades would be 75, 68, and 58 
per cent respectively for the four-, two-, 
and one-year averages, 

It will be noted that the reliability of 
the one-year averages here is the same as 
the reliability of one-semester averages re- 
ported above (.784 and .785). This is not 
what we should ordinarily expect since the 
one-year averages include a larger number of 
measures than the one-semester averages. But 
the correlation between the first- and sec- 
ond-year averages represents the relation be- 
tween different measures, as the students 
took different courses in the two years, even 





1. T. L. Kelley, Statistical Method, (New York: 


The Macmillan Company, 1925), p. 222. 





varch, 1933 


if in the same departments. This coefficient 
is comparable to that between two forms of a 
test, rather than two administrations of the 
same form. The reliabilities in Table I my 
be regarded as the lower limits of reliabil- 
ity rather than the true reliabilities.! Tis 
makes the high limits of possible correla- 
tion all the more significant, 

It seems then that the relatively low cor- 
relations between test scores and school 
marxs is due very largely to a fundamental 
lack of relationship between the functions 
measured, rather than to the unreliability of 
the measures used, It may not be concluded, 
of course, that intelligence and scholastic 
ability are so unrelated. It may be that 
school marks, although they may be quite re- 
liable, are very imperfect measures of schol- 
astic ability; or that the intelligence tests 
do not measure intelligence; or both. 

It is interesting to note that while the 
correlation between test scores and first- 
semester averages was .510, the correlation 
between test scores and two-year averages 
was only .491. This latter coefficient was 
tested by the appropriate correction formul 
and was found to be unaffected by the selec- 
tion due to students dropping out. The vari- 
ability of the test scores was smaller than 
that of the unselected group, but the vari- 
ability of the grades was actually larger. 
While the reliability of the averages in- 
creases with longer periods the correlation 
with test scores remains approximately con- 
stant. This indicates that with progress 
through the four-year coui'se the other fac- 
tors than ability, e.g., industry, back- 
ground of previous preparation, etc., become 
relatively more important as determiners of 
scholastic success, 

There is abundant evidence that teacher's 
individual marks, such as grades on a given 
paper, are highly unreliable. This does not 
indicate, however, such unreliability of 
semester grades, which are composed of a 
large number of individual measures taken 
throughout the semester, When we consider 
semester averages, 4S well as averages for 
longer periods, which are based on even larg- 
er numbers of individual measures, there is 


H. Easley 








275 


every reason to believe, as we have found 
here, that they are much more reliable than 
single marks. It should be repeated that 
all our measures of reliability used here 
should be regarded as lower limits, and not 
the exact reliabilities of grades, 

The reliability of individual semester 
grades was determined by the coefficient of 
contingency method,® each grade for an indi- 
vidual being plotted against each of his 
other grades. With a five-fold classifica- 
tion this may be regarded as practically 
equivalent to the Pearson product-moment cor 
relation. This coefficient was .43, which 
is almost identical to what we should have 
found from the two-grade semester averages 
by the use of the Spearman-Brown formula, A- 
gain this coefficient should be regarded as 
the lower limit of the reliability of indi- 
vidual grades, Since it represents the cor- 
relation between different school subjects, 
variations in the factors mentioned above 
(industry, preparation, etc.) would operate 
freely to lower the correlation while not 
all affecting the individual teacher's opin- 
ion. 

Table II summarizes the reliabilities of 
grades found in this study, the limits of 
possible correlation with test scores, and 
in two cases the actual correlation with 
test scores, 


TABLE II 


TdE RELIABILITY OF GRADES FOUND IN 
THIS INVESTIGATION 





Semester 
Grade 

Semester 

Average 





» 
a 
N 
@ 
o 


Reliability 
Limit of 
Correlation 
with test 
Correlation 
with test 


- 386 
-510 


946 
491 








*Reliabilities determined by the Spearman-Brown 
formule. 





l. T. Le Kelley, Statistical Method; (New York: 
2. Ibid., Pe 225. 


3. H. E. Garrett, Statistics for Students of Education and Psychology, (New York: 


p. 195. 


The Macmillan Company, 1925), p. 205. 


Longmans, Green and Company, 1926), 








t- +*t.4a. a 


Ww 


=— 


at eee 


JOURNAL OF EXPERIMENTAL EDUCATION 


CONCLUS IONS 


We may conclude that the low correlations 
between intelligence test scores and school 
marks is due very largely in the case of se- 
mester averages. and almost entirely in the 
case of two-year averages, to a fundamental 
lack of relationship rather than to the un- 
reliability of school marks. Our results 
give some indication that increasing the re- 
liability of school marks has practically no 
effect on their correlation with test scores, 
While increasing the reliability would be 
expected to increase the correlation with 
test scores, in this case other factors were 
inevitably introduced which tend to lower 
the correlation, and the net result is that 
it remains approximately constant. This re- 
lation obviously could not hold for extreme- 
ly low reliabilities, but teachers' marks 
would not have to have a reliability above 
-30 to make possible a correlation with test 
scores as high as is usually found, 


STATISTICAL NOTE 


When the measures are expressed in terms 
of standard deviations the regression of one 
variable on another is represented by the 
equation: Xp = Mp * Tip (X; - My). If rie = 
-80 and (X; - M,) = -20,, then (Xe = Me) 
would be -1.60,. The variation of the cor- 
responding true scores (Xp =- Mp) will bear 
a certain relation to the variation of the 
whole distribution of scores. This relation 





Volume I, No, 3 


is expressed by the coefficient of aliena- 
tion, or Kig which is¥l-fYrj,. Hence the 
standard deviation of this group of scores 
Will be Kie* Og, or .60,. Assuming that 
these true scores make a normal distribution 
around a mean 1.602 below the mean of the 
whole distribution, and with a standard devi- 
ation equal to .602, we can be practically 
sure that no scores will be more than 30 a. 
bove their mean, or above 3x.602 = 1.602. 
The range of these scores is limited by 2,80,, 
or 46.7 per cent of the range of the whole 
distribution of variable (2). 

The correlation between two measures is 
limited by the reliabilities of the two meas- 
ures. The true correlation between the func- 
tions may be found by the following formula: 


Tie 
= —————__,, in which rm, 18 the true 


Tow 
V Til VTee 

correlation between the two functions, ry), 

is the obtained correlation between the two 

measures, and r,, and rg, are the respective 

reliabilities of the two measures. This re- 

lationship may also be expressed as: Yj, = 

Too VTi VFee- Also the true correlation be- 

tween measure (1) and a perfectly reliable 
r 

measure of function (2) would be ~ So 
r, 

we assume a perfect relationship between the 

two functions, then the maximum obtained cor- 

relation between the two measures becomes 

VTirVTee- Likewise, the maximm correlation 

between measure (1) and e perfectly reliable 

measure of function (2) would be foe. 





THE VALIDITY OF CERTAIN PROGNOSTIC TESTS IN PREDICTING 
ALGEBRAIC ABILITY 
T. L. Torgerson 
University of Wisconsin 
Geneva P, Aamodt 
Muskegon, Michigan 


During the past ten years an increased 
emphasis has been placed upon the importance 
of knowing the pupil, the provision of edu- 
cation in terms of individual needs, and the 
use of more scientific techniques in pupil 
guidance. In this modern program of educa- 
tion, aptitude tests are receiving consider- 
able attention, and their inherent values for 
prediction and guidance are quite generally 
recognized. While aptitude tests are objec- 
tive, and some of them are highly reliable, 
the fundamental issue of whether or not they 
have sufficient validity to insure their use- 
fulness as predictive instruments in a guid- 
ance program is still largely open to ques- 
tion. 


This study was undertaken in an attempt to 
throw some light on the question of the va- 
lidity of two aptitude tests in algebra as 
compared with an intelligence test, The tests 
chosen were the Lee Test of Algebraic Abili- 
ty, the Orleans Algebra Prognosis Test and 
the Otis Self Administering Test of Mental 


Ability, Higher Examination. The tests were 
administered during the first month of the 
school year, 1931-32, to 236 ninth-grade pu- 
pils in the Muskegon Michigan High School, 
All of these pupils were enrolled in algebra 
classes, and with the exception of -30 pupils 
who were pursuing general mathematics, con- 
stituted the entire ninth-grade class. The 
students enrolled in general mathematics were 
not included in this study. 

The relationship of the several measures 
used in the study are shown in Table I. The 
intelligence quotients correlated with the 
marks in algebra at the end of the first se 
mester to the extent of .61. The Orleans 
Test and grades in algebra yielded a coeffi- 
cient of .60 while the Lee Test and algebra 
grades yielded a coefficient of .62. These 





coefficients indicate that the intelligence 
test and the aptitude tests were about equal 
ly effective in predicting success in alge- 
bra. The intelligence quotients correlated 
with the Orleans scores to the extent of .64 
and with the Lee scores to the extent of .68, 
The correlation between the Orleans and the 
Lee test scores was .76. These coefficients 
indicate that the intelligence test and the 
aptitude tests measured the same abilities 
to a marked degree but that the two aptitude 
tests did measure to some extent abilities 
aside from intelligence. 


TABLE I 


ZERO ORDER COEFFICIENTS SHOWING THE 
INTERRELATIONSHIP OF THE SEVERAL VA- 
RIABLES. N = 236. 





Otis IQ's with Orlean's Algebra 
Prognosis Test Scores 

Otis IQ's with Lee's Algebraic 
Ability Test Scores 

Orlean's Test Score with Lee Test 
Scores 

Otis IQ's with First Semester 
algebra Grades 

Orlean's Algebraic Prognosis Test 
Scores with algebra Grades 

Lee Algebraic Ability Test Scores 
with algebra Grades 


-6375 + 
-6821 
7579 
-6076 
6036 
-6208 





The value of an aptitude test in algebra 
is largely dependent upon the success with 
which the test can select pupils who will 
fail because of lack of the mathematical ap- 
titude necessary for success in that sub- 
ject. If many of the pupils who received low 
scores in the test secured passing grades in 
algebra, the test would be of little or no 
value for purposes of prediction and guid- 
ance, Some pupils who receive high scores on 
the test may fail because of other factors 
conditioning the learning process such as 











278 


lack of industry, poor study habits or un- 

satisfactory attitude. This fact does not 

lessen the validity of the test for purposes 
of guidance if the test discriminates suf- 

ficiently to permit the establishment of a 

critical score below which pupils will not 

succeed, 

The first semester marks in algebra were 
tabulated in relation to the intelligence 
quotients. Pupils with intelligence quo- 
tients below 90 are generally classified as 
dull and are rarely capable of mastering the 
traditional high school curriculum, Of the 
seventeen pupils with intelligence quotients 
below 90, thirteen or 76 per cent failed in 
algebra the first semester and the remaining 
four received D's, the lowest passing grade. 
The second semester two of these D pupils 
failed and one left school before the end of 
the year, It is likewise true that thirty- 
five other pupils with intelligence quotients 
above 90 also failed the first semester. The 
causes of these failures, however, probably 
cannot be attributed to low intelligence. 
The percentage distribution of the marks in 
algebra are shown in Table II. 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No.3 


failed at the end of the second semester, 
while three others received D's and one left 
before the end of the year. The 40th per- 
centile was selected because it is the point 
in the distribution of Lee scores below 
which the minimum number of A's, Bts and 
C's appear. The author of the test suggests 
a raw score of 60 as the critical score, 
Twenty pupils received scores below 60, and 
of this group fourteen or 70 per cent failed 
the first semester while the remaining six 
received scores of D. (See Table III on the 
following page.) 

The distribution of marks in algebra in 
relation to the percentile scores on the Or- 
leans test is shown in Table IV. Twenty- 
four pupils received scores below the twen- 
tieth percentile and of this number two re- 
ceived B's in algebra at the end of the 
first semester; eight received D's and four- 
teen failed, Seven of the eight who re- 
ceived D's failed the second semester, and 
the other one left before the end of the year, 
The same criterion was used in selecting the 
critical point as was used in the Lee test, 
The author of the Orleans test suggests that 


TABLE II 


PERCENTAGE DISTRIBUTION OF FIRST AND SECOND SEMESTER ALGEBRA GRADES 
IN RELATION TO INTELLIGENCE QUOTIENTS, 236 PUPILS 





Semester Grades 





Cc 
IQ's 


Lal 
ll 


m 
Lol 





125-129 
120-124 
115-119 
110-114 
105-109 
100-104 
95-99 
90-94 
85-89 
80-84 
75-79 
70-74 


2AanovnooLs 
* « 
> PbO OL 


RPreouno-2r 
RF sa 


Oeroournnns 
Or OO Ot 
. 








ra) 
* . . . . . . os * 
& © SLBOAHHLOD 
Ker Oe 
rouPee aaSh 
ent ef @ @ © © © © @ @ 
POHHOOBAUTAND® 

















= 


The distribution of algebra marks in re- 
lation to the Lee Aptitude test scores as 
shown in Table III place twenty-six pupils 
below the 40th percentile score. Two of 
these pupils received B's, eight received 
D's and sixteen failed. Four of those re- 
ceiving D's at the end of the first semester 





in order to determine the per cent of pupils 
that probably will fail one should take 4 
group which is one-half the per cent of 
failures in algebra in the school over 4 


period of years. These pupils may be ex- 
pected to have but one chance in four to 
succeed in algebra. The average per cent 





March, 1933 


T. L. Torgerson and G, P. Aamodt 


TABLE III 
PERCENTAGE DISTRIBUTION OF FIRST AND SECOND SEMESTER ALGEBRA GRADES IN 


RELATION TO THE PERCENTILE SCORES ON THE LEE TEST, 236 PUPILS 





Percen- 
tile 
Scores 


Semester Grades 


I 


Cc 


D 


ln) 
— 





100 








on 


He rt 
. .* . 


POPDRO’D 


. 
» 





PP OULD LOIUN 


rrr arau S 
*-. 


. . 
OHS OAH OSLOAeE 
— 
* 
b 








rPwnNOne fF 
. . . 
Ornr@enm Qe 


ee 
»@® 





PRR anonnw 


rs 
° 
~ 


© 


— 
NO FPADOAramrn~7 
> . . . . 


&ODOLOrvAIgTare Ue 





TABLE IV 


PERCENTAGE DISTRIBUTION OF FIRST AND SECOND SEMESTER ALGEBRA GRADES 
RELATION TO THE PERCENTILE np ON THE ORLEANS TEST, 236 
PUPI 





Semester Grades 








Percen- 
tile 
Scores 
100 
90 
80 
75 
70 
60 


40 


+ 


ry 


el SE el ool A) 
. 


> 


| 
| 


Lan 
Ln] 


Lan) 


Cc 


Lan) 
Lan! 


D 
II 


Lan] 


Lan! 
Lan! 








RPrReROe FADO 


9 
1.9 
1.4 
2.4 
3.3 
1.9 
3.8 
4.3 
1.4 


~ 


~~ 


— 


o 
Ob YHRONMMO? 
COLL OAH OD 


KROME Ree 


re 
*-e« 
° 
*e . . 
DADAKYALAAHAD 
 % MwA Mr 
. . 


Pb hh eOIO OD oO 


*e 
dt 
. 











3.3 
3.3 . 4 
9 5.7 











. 
rPABDONFOVOOON 

7 
NIUVOVTINIOONURMOo 
. . a 
AABMARWORBONNH UA 


rn e 
OABAGOBOS UO 





of failure in algebra in the Muskegon High 
School was sixteen. Therefore eight per 
cent or approximately eighteen pupils would 
be expected to fail. Eleven or 61 per cent 
of the eighteen pupils receiving the lowest 
Orleans scores failed the first semester and 
five or 28 per cent received D's, At the 
end of the second semester all but two pu- 
pils from this group failed, 

To what extent did the three tests agree 
on the pupils that probably would feil in 
algebra? The Orleans test and the intelli- 
gence test agreed on nine cases; the Lee 
test and the intelligence test agreed on 
fourteen cases, and the three tests agreed 
on eight cases. All of these eight cases 





failed in algebra at the end of the first se- 
mester. 


SUMMARY 


a. All three tests were about equally valid 
and effective in predicting grades in 
algebra, 

b. The two aptitude tests were about equally 
efficient in setting up a critical score 
below which the students! chances for suc- 
cess were slight, 

The sharpest discrimination was made by 
the intelligence test as twenty-two of the 
twenty-three pupils with intelligence quo- 
tients below 90 failed in algebra at the 
end of the year. 











cathe iit an ee ae 















































JOURNAL OF EXPERIMENTAL EDUCATION 


A NATURAL TEST OF ENGLISH USAGE 








Volume I, No, 2 


Roland L. Beck 
Professor of Education 
Central State Teachers College 
Edmond, Oklahoma 


THE NEED FOR A NEW TEST 


Although numerous investigations have 
shown that such exercises as proof reading, 
error-checking, and multiple choice yield 
poor measures of a student's skill in writ- 
ing, all tests constructed for the purpose 
of diagnosis in English composition have 
used variations of these exercises. Several 
tests employ sentence completion exercises 
to measure a limited number of items; but 
proof reading, error-checking, or multiple 
choice exercises constitute the major por- 
tion of all language tests. Proof reading 
requires the recognition of errors, and mul- 
tiple choice demands the selection of the 
correct response, but neither requires com 
posing. Lyman! says that the value of such 
language tests rests on "---one highly ques- 
tionable assumption, namely, that a child's 
ability to recognize an error and to correct 
it is indicative of his own present or fu- 
ture use of the particular form." 

Free composition is not satisfactory as 
a performance for measuring specific lan- 
guage ability, because a writer will often 
consciously avoid difficult or doubtful 
forms. Furthermore, time would not permit 
the teacher or research worker to establish 
error-quotients, because at least one hun- 
dred thousand running words of each stu- 
dent's composition would be required for re- 
liable standards, Few students write one 
hundred thousand running words in a year, 
Error-quotients should give the percentage 





of errors per opportunity and the opportunity 
for a given number of words, Since students 
may consciously avoid doubtful or difficult 
constructions, the opportunity for errors 
cannot be determined. Even if such quotients 
could be established they would not hold for 
each student, since each student has his om 
vocabulary of errors. 

Since free composition is impractical as 
a& means for measuring language ability, and 
proof reading and multiple choice forms of 
tests do not yield measures of language a- 
bility, some other type of language perform- 
ance is needed. The sentence completion ex- 
ercise offers a means for controlling compo- 
sitions. The words needed in completing 
sentences are supplied by the student, and 
probably their form represents his natural 
or habitual English usage. The purpose of 
this study was to construct a test of Eng- 
lish composition which uses the sentence cor 
pletion exercise and to determine its relia- 
bility and validity. 


THE ITEMS OF THE TEST 


First, the findings of important studies 
of students' errors in free composition were 
pooled. The basis of the resulting list of 
items was a classification of errors given 
by Lyman®, which was a rearrangement of the 
errors found in high-school composition in 
the studies of Johnson®, Lyman*, Stormzand®, 
and Armstrong®, The list was supplemented 
and enlarged, especially in the divisions 





1. Re L. Lyman, 
Chicago, 1929 » De . 
2. ss Pp. 87-89. 





3. R. G. Johnson, "Persistency of Errors in English Compositi 
4. R. L. Lyman, "Fluency, Accuracy, and General Excellence in 


pp. 85-100. 
5. Martin J. Stormzand, and M. V. O'Shea, 30 
6. Wallace B. Armstrong, A Study of the 

Master's thesis, Dep ; uc 














of Investigations Relating to Grammar, Language, and Composition (Chicago: The University of 
on," » XXV (October, 1917), pp. 555-530. 
Daglieh conpeltins,* School Review, XAV1 (February, 1918), 


ar? (Baltimore: Warwick & York, Inc., 1924), Pp. 224. 
acy of Technical Errors in Written Composition. 









March, 1933 R. L. 
relating to capitalization and punctuation, 
by a study of the Mechanics of Writing by 
Pencel, Such rules as occurred frequently 
in use were sampled. Certain rules were e- 
liminated, because they were too specific in 
their scope or dealt with the artistic use 
of punctuation. The rules added are found 
in most handbooks of composition. Studies by 
Harap’, Lyman®, Pressey*, and Rinsland® were 
used where the original classification was 
stated in general terms, 

Some items which were included in the 
writer's list by utilizing the above research 
studies were subsequently eliminated, because 
75 per cent of the linguists in a study re- 
ported by S. A. Leonard®, did not consider a 
distinction necessary. Later, upon the rec- 
ommendation of Pence’, the divisions were 
grouped under mechanics, grammar, and rheto- 
ric, but no new divisions were added, 


THE CONSTRUCTION OF THE TEST 


Two forms of a test based on this list of 
rules were constructed. Variations of the 
sentence completion exercise were employed 
in both Form A and B. Form A was construct- 
ed first. Form B is designed to measure the 
same language items as Form A, but the sen- 
tences are not the same. The test is divid- 
ed into three parts, namely, mechanics, gram- 
mar, and rhetoric. Part I of the test, me- 
chanics, measures rules of capitalization, 
apostrophe, and punctuation; Part II, gram 
mar, measures the use of adjectives and ad- 
verbs, verbs, pronouns, and prepositions and 
conjunctions; and Part III, rhetoric, meas- 
ures meaning, paragraphing, and omissions 
and repetitions. An objective key was used 
in scoring the test papers, and each part of 
the test was scored on a point basis. 

The directions of the test cause the stu- 
dent to focus his attention on the meaning 


Beck 





281 


of the sentences and not primarily on the 
test items to be measured. The parts and 
divisions of the test do not indicate what 
is measured. All terms suggestive of gran 
mar, mechanics, or rhetoric are reserved for 
the key. Thus the test measures the natural 
or habitual reactions in composition without 
any emphasis on the grammar or mechanics of 
English. A mimeographed edition of the test 
was administered to determine the time to be 
allowed and to sample the type of response 
given to the completion exercises. Certain 
sentences were eliminated and others were im- 
proved on the basis of the results secured 
from this preliminary testing. 


THE DATA COLLECTED 


Two tests (which shall hereafter be refer- 
red to as the first and second tests) were 
given to an experimental group of college 
students. The resulting scores were used in 
establishing the reliability of the test and 
the coefficient of validity between the test 
and free composition. From 1,000 to 2,000 
running words of each student's composition 
were collected as a criterion of free compo- 
sition. The average number of words was 
1,611. 

All errors per 100 running words in the 
compositions of the 99 students of the exper- 
imental froup were counted and charted in 11 
categories--namely, capitalization, apostro- 
phe, punctuation, adjectives and adverbs, 
verbs, pronouns, prepositions and conjunc- 
tions, meaning, paragraphs, sentence struc- 
ture, and omissions and repetitions, The e- 
leven divisions of the test correspond to 
these eleven error categories. All errors 
were counted and not just errors correspond- 
ing to test items--i.e., all verb errors for 
all verbs were counted. All errors of the 
first count were weighted on the basis of the 





. R. W, Pence, A Manual of the Mechanics of Writing (New York: 
. Henry Harap, "The Most Common Grammatical Errors," English Journal, XIX (June, 1950), pp. 440-446 





The Macmillan Company, 1929). 





- R. L. Lyman, Summary of Investigations Relating to Grammar, Language, and Composition (Chicago: 


The University of 





Chicago Press, 1929), p. 92. 


. S. L. Pressey, "A Statistical Study of Children's Errors in Sentence-Structure," English Journal, XIV (September, 


1925), pp. 529-555. 


- Henry D. hinsland, Standardized Tests and Practice Exercises in High School English. 


(Address before the American 





Educational Research Association, Boston, February, 1928). 


Norman, Oklahoma, 1928, p. 5. 


Bureau of Educational Research, University of Oklahoma, 


S. A. Leonard and 8. Y. Moffett, "Current Definition of Levels in English Usage," English Journal, XVI (May, 1927), 


pp. 345-559. 


Recommendations received in personal letters sent to the investigator by Professor Rk. W. Pence, DePauw University, 


Greencastle, Indiana, 1951. 














282 


variability of the errors of the error cate- 
gories and the scores of the divisions of 
the test. 

In order to determine the reliability of 
the error count in the running words of com 
position a second count was made for SO cas- 
es. The errors for the first half of the 
second count were listed separate from the 
errors of the second half of the second 
count in order to determine the reliability 
of students!’ errors. 

Proof of linear relationship between the 
weighted criterion and test was established 
by Blakeman's! test for linear relationships, 

Three groups of college freshmen and 
three groups of high school seniors also 
took the test. The test scores for the 
three groups of college freshmen were corre- 
lated with their first semester grades in 
English. 

The Ohio Psychological Examination was 
given to one of the college freshmen groups, 
and coefficients of correlation were calcu- 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 2 


lated between each form of the test and in- 
telligence test scores, The grades for this 
group were correlated with intelligence test 
scores, and multiple coefficients of correla- 
tion were calculated between the grades of 
this group and a combination of each form of 
the test with intelligence test scores, 


RELIABILITY 


In Table I, the coefficients of reliabil- 
ity, standard deviations, means, and the nur- 
ber of cases are given for the experimental 
group on the first and second tests. The test 
as a whole has a reliability of .935 with a 
probable error of .0085. The index of reli- 
ability for the test (.936) indicates the 
highest reliability which may be obtained for 
the test under similar conditions. 

In order to decide whether the test should 
be a timed test or whether students should te 
allowed to finish all of the exercises, co- 
efficients of reliability were calculated for 


TABLE I 


COEFFICIFNTS OF RELIABILITY, STANDARD DEVIATIONS, MEANS, 
AND NUMBER OF CASES FOR THE EXPERIMENTAL GROUP 





Pele 
ofr 


g 


2nd 
8.D. 





0397 
-0247 
0312 
-0177 


0524 
-0517 
-0461 
-0163 
0136 
-0478 
0405 
0114 


-0596 
0512 
+0397 
-0584 
0486 
-0289 
+0213 


or 


PBeaeeas aS 


NORM OR ORR OGRE EY 
owns 
SSS2S22 


o 
= 


-0085 

















3.07 
2.85 
4.28 
8.09 


-98 
1.05 
1.02 
5.66 
5.21 
1.52 
1.60 
8.73 


1.51 
1.49 
1.40 
3.79 
1.80 
3.29 
6.14 


~ 
a 
. 


& 
ROW 
.* . * . 
Ome 


uo 
~ 
. 


Pl8eaen RoR) F 


<<} Peres 
. * 
. . . * ee 


an 
. 
. 
NVABNIORA ARAM HYrRON 


on 


me 
rPoun 


OORUARM OHDOOHDODS 


oa 
*eeveeve 
Goasada Seo 





g 


19.71 











*Sections a and b (Unity and Coherence) are combined to measure paragraph items. 





1. J. Blakeman, "On Tests for Linearity," Biometrika, IV (1906), pp. 552-560. 





R. Le 





Beck 


TABLE II 


COEFFICIENTS OF RELIABILITY, STANDARD DEVIATIONS, MEANS, AND 
NUMBER OF CASES FOR ALL OF THE EXPERIMENTAL GROUP WHO 
COMPLETED THE FIRST AND SECOND TESTS 





lst 
8.D. 


P.E. 
ofr 


lst 
Mean 


2nd 
S.D. 





0163 7.57 


-0145 7.62 


0261 6.60 














7.91 54.2 


6.26 











all of the experimental group who completed 
each part or the total test on the first and 
second tests, These coefficients of relia- 
bilities are given in Table II, 

The difference between the reliabilities 
of the parts of the test in Table I and Ta- 
ble II are not statistically significant, but 
the difference between the reliabilities of 
the total test is significant. The differ- 
ence between the reliabilities of the total 
test divided by the probable error of the 
difference is 3.43, which indicates that the 
chances for a true difference greater than 





zero are 99 in 100, This indicates that stu- 
dents should be allowed to finish the test. 

The reliability of the count of errors in 
composition as calculated by Willing! was be- 
tween the count of one person with the count 
of a second person. In this study the count 
was made twice by the same person, since two 
different persons do not ordinarily score the 
same composition. The reliability of the er- 
ror count (.981) indicates that a teacher can 
mark compositions for errors with a high de- 
gree of consistency. 

The reliability of students' errors is 
found by correlating the errors in the first 


TABLE III 


COEFFICIENTS OF RELIABILITY 
AND NUMBER OF CASES 


STANDARD DEVIATIONS, MEANS, 


FOR STUDENT ERRORS 





r* is alf 


1H 28 


§ 
a 
ow 


Half 
oBe 


lst Half 2nd Half 


Mean x 





-668 
+410 
447 
- 588 


- 269 
+426 
415 
219 
-593 


-505 
197 
439 
227 
-504 


ee 
eo 
or 


WOrR | Der 


~ 
. 


*eee 
S328 a8 
. * « 
SRaOS ROow 


. 


aoa 
eSass 


FORA Beer PH 
a. >. _ 


SrroDm om 
RPOaOar w 


~ 


II! 


Total 
Test 














-651 22.94 18.27 


-800 
-581 
-617 
- 740 


5.38 
1.10 
16.02 
22.50 


-96 
1.20 
1.70 
1.28 
5.14 


6.14 
3.76 
3.58 
10.40 
23.78 


+423 
597 
- 586 
- 483 
- 744 


-671 
-329 
610 
492 
670 


OPmMAn Teme 
SSS8S RBRSE 


0425 


-0523 
0849 
0598 
0722 
-0526 


~ 





S$ S888 S8Sss S$Sss le 


51.44 45.14 














.787 | .0362 





*Correlation between the errors in the first half and the errors in the second half of students’ 


compositions. 


**Self-correlation between the errors in the first half and the errors 


in the second half of stu- 


dents' compositions stepped un bv Snearman's "prophecy" formuia. 


l. Matthew H. Willing, 
(New York: 


Valid tgs in High School REL? 
Teachers College, ia University, 19°26), p. 64. 


Teachers College Contributions to Education, No. 250 








284 JOURNAL OF EXPERIMENTAL EDUCATION 


half with the errors in the last half of sti 
dents' compositions. The self-correlations 
and the self-correlations stepped up by 
Spearman's “prophecy” formula are given in 
Table III. 


VALIDITY 


Three coefficients of validity were cal- 
culated between the test and the criterion. 
The first one (.714) was between criterion 
errors per 100 running words without weights 
and the first test. The second one (.728) 
was between weighted criterion errors per 
100 running words and the first test. The 
third one (.819) was between weighted cri- 
terion errors and the first test, supplement- 





Volume I, No, 2 


ed with all scores on the second test that 
were 10 or more points higher than those op 
the first test. The latter validity coefri- 
cient was determined to show the effect of 
the failure to follow directions on the 
first test. 

The coefficients of validity for the test 
with the criterion with weights (weichted 
errors per 100 running words of students! 
compositions) are given in Table IV, The 
correlations corrected for attenuation are 
based on the reliabilities of students! er- 
rors and the reliabilities of the parts of 
the test and the total test. The validity 
of the test secured by correlating test 
scores and student errors (.728) is almost 
as high as the reliability of students! er- 


rors (.787). 


TABLE IV 


COEFFICIENTS OF VALIDITY, STANDARD DEVIATIONS, MEANS, COEFFICIENTS 
OF VALIDITY CORRECTED FOR ATTENUATION, AND NUMBER OF CASES FOR THE 


FIRST TEST WITH WEIGHTED CRITERION 











Total | 





P.B. 
ofr 


Error 
S.D. 


Test 
§.D. 


Error 
Mean 


Test 
Mean 





a 
-0508 


0364 
-0627 


-0414 


| 


65 
1.01 
71 


1.83 





8.15 
7.44 
6.84 


22.86 





1.65 
1.22 
1.59 


4.33 


53.5 
59.1 
45.0 


155.3 








-.728 





99 
96 
88 


99 





Test | -.624_ 





*The validity coefficients are negative because students who made the fewest errors in their compo- 


sitions made the highest scores on the test. For this reason they indicate a positive relationship 


and are referred to as positive. 


TABLE V 


COEFFICIENTS OF VALIDITY, STANDARD DEVIATIONS, MEANS, AND NUMBER 
OF CASES FOR TEST WITH FIRST SEMESTER COLLEGE GRADES 





College 


DePauw 
University 


Union 
University 


Oklahoma 
Baptist 
University 


Piz 


Pele 
of r 


Test 
&.D. 


Grade 
&.D. 


Test 
Mean 


Grade 
Mean 





—-4 








-555 
-629 


- 616## 


-639 
- 668 


- 0880 
0637 
-0652 


-0399 
-0373 





4 





5.33 
19.70 
19.70 


16.14 
16.13 





1.05 
2.00 
11.06 


2.51 
2.51 





177.67 
152.60 
152.60 


168.33 
157.37 





2.53 
6.75 
84.58 


6. 70 
6.70 








*Last Division of Part III of each form omitted except for Oklahoma Baptist University, Form 


**Pirst Semester Final Examination Grades. 





March, 1933 R. L. 


Coefficients of validity secured by cor- 
relating the test and first semester college 
freshmen grades in English are givenin Ta- 
ble V. The coefficients range from .555 to 
668. 

A further study of validity was made with 
the Oklahoma Baptist University freshmen 
group. Correlations between the test, se- 
mester marks in English, and scores made on 
the Ohio Psychological Examination are given 
in Table VI. The multiple coefficients of 
correlations between grades in freshmen Eng- 
lish and the combination of the test and in- 
telligence test scores (.735)for Form A and 
.758 for Form B) were also computed. These 
two tests combined into a team will predict 
first semester English grades as well as 
grades in English will predict other English 
grades of the same students. 


TABLE VI 


CORRELATIONS BETWEEN (1) FIRST SEMESTER GRADES, 

2) SCORES ON OHIO PSYCHOLOGICAL EXAMINATION, 

$3 SCORES ON FORM A OF TEST, AND (4) SCORES ON 

FORM B OF TEST FOR THE OKLAHOMA BAPTIST UNIVER- 
SITY FRESHMEN 





] P.E. 


of r S.D.e 


38.87 


Variables 
12 ~0309 


§.D.1 
2.€9 





16.14 
16.13 
16.25 
15.60 


13 -0399 2.51 


14 0373 2.51 
37.55 


38.70 


23 / .0349 
24 | ,o39e2 




















SUMMARY 


The grade validities are as high as sin- 
gle tests can be expected to correlate with 
college grades in English, especially when 
all of the field of English was not tested. 
Knowledge of grammar, literature, and spell- 


ing were the fields omitted, The test pre- 
dicts grades in English about as well as 
English grades will predict future grades in 
English, 

The correlation between the test and in- 
telligence shows a definite relation despite 
the fact that in constructing the test spe- 
cial precaution was taken to eliminate the 
factor of intelligence. The completion form 
may be responsible for this in part. Cer- 
tainly the fact of relationship is desirable 





Beck 285 
and harmonizes with all psycholocical reason- 
ing in that skill in =nglish usage requires 
intelligence. 

The coefficient of validity (.728) secured 
by correlating weizhted errors per one hun- 
dred running words and test scores was not 
statistically different from the coefficient 
of validity (.714) secured by correlating 
the errors per one hundred running words and 
test scores. This seems to indicate that all 
errors should be counted, especially since 
error-quotients are not practical, and the 
simple unweighted errors can safely be used 
without the added labor of weighting. 

In spite of the limitations of the crite- 
rion (weighted errors per 100 running words 
of students' compositions) the coefficient 
of validity between the test and the criteri- 
on shows a definite relation. Ina natural 
test of English composition a higher criteri- 
on validity mizht be expected than was se- 
cured. The lower degree of validity is prob- 
ably due in part to three factors. In the 
first place, all items of the test were not 
sampled in the criterion. In the second 
place, some errors occurred so frequently in 
certain cases that the validity coefficient 
was seriously affected. In the third place, 
certain qualities in the test could not be 
measured in the criterion, except in a gen- 
eral way, and for that reason could not be 
scored on a point basis. Composition scales 
measure these qualities by comparisons with 
specimens of certain quality. 

These considerations indicate that the 
true validity is really higher than the cri- 
terion validity coefficient obtained. A 
strong argument for the validity of the test 
is the fact that the sentence completion form 
does require the student to focus his atten- 
tion on the meaning and the completion of the 
sentences and thus measures his natural or 
habitual composition reactions, and that the 
test samples many more rules than even a 
large number of free compositions would con- 
tain. Apparently the test is a measure of 
language habits and practice rather than 
textbook knowledge of language. The comple- 
tion exercises may be correctly responded to 
on 2 basis of the students’ natural reac- 
tions without knowledge of specific rules. 

It is thus in form a natural test of English 
usage. This condition makes the test more 





286 


acceptable than other types of tests now a- 
vailable. 

The statistical results coupled with the 
psychological hasis of the test indicate 
that future tests of specific language abil- 
ity should use sentence completion exercises 
or variations of this form to measure skill 
in composition. The test itself can be im 


proved by a complete revision, and the parts 
of the test can be made more diagnostic by 


JOURNAL OF EXPERIMENTAL EDUCATION 





Volume I, No, 3 


increasing the number of times each rule is 
sampled, Completion sentences based on this 
study can also be used in constructing reme- 
dial practice exercises which with the test 
and pooled list of rules provide a program 
for diagnostic testing and remedial teaching 
of the fundamentals of English composition, 





march, 1933 





287 


PRELIMINARY REPORT ON THE STANFORD BINET IQ CHANGES OF 
SUPERIOR CHILDREN 
E. A. Lincoln 
Harvard University 


The writer became interested in the prob- 
lem of IQ changes among superior children be- 
cause of an apparent contradiction between 
the results obtained by Terman in his study 
of gifted children in California,? and those 
obtained by Cattell in her analysis of the 
Harvard Growth Study data. In the former 
study it was discovered that the IQ's of 
cifted children (those having an initial IQ 
of 140 or more) dropped substantially over 
a period of six years and that the girls lost 
more than the boys. The Harvard Growth Study 
Subjects were cases in which the average of 
two or more [Q's was 120 or over, and these 
pupils showed a decided tendency to gain. It 
is quite possible, of course, that the dis- 
agreement in these two sets of findings is 
the direct result of the different methods 
of selecting the cases. It is clear that the 
regression effect will be greater in a group 
where the initial IQ's are over 140 than in 
a@ group which includes lower IQ's. Further- 
more, the method of averaging used by Cate 
tell gives a more reliably determined group, 
and eliminates some of the cases with sub- 
stantial losses.® Notwithstanding these 
plausible explanations of the different find- 
ings, it was considered desirable to follow 
this matter further, and with this purpose 
in mind the present investigation was under- 
taken by the writer and Dr. Cattell. 

The children whose records are presented 
in this report have all been tested both 
times by the writer. This gives an element 
of constancy which has not appeared in other 
investigations, The cases come from three 
towns which make a practice of admitting to 





the kindergarten and first grade, children 
who are under the chronological age limit,if 
they obtain a Stanford Binet mental age which 
equals or surpasses the required chronologi- 
cal age. Up to the present time, 92 chil- 
dren with initial IQ's of 119 or over have 
been re-examined at intervals ranging from 
five to eight years... 

Distributions of the IQ's of all the chil- 
dren who were examined and from which the 92 
cases for the present study were selected are 
presented in Table I. The group as a whole 
was somewhat above average, as the median f 


TABLE I 


IQ DISTRIBUTION OF UNDER-AGED FIRST GRADE 
AND KINDERGARTEN CANDIDATES 











| BOYS | GIRLS 
—S | 8 # n 
——E —+ ——a —_ ———e 
145-9 | 1 4 1 3 
140-4 | + 1.4 5 1.6 ) 
135-9 2 o@ 7 2.2 
130-4 | 12 4.3 7 2.2 
125-9 | 10 3.5 19 6.0 
120-4 23 8.2 34 10.8 
115-9 | 25 8.9 43 13.6 | 
110-4 42 14.9 45 14.3 
105-9 | 52 18.4 55 17.5 | 
100-4 | 35 12.4 35 ll.1 |") 
95-9 | 16 5.7 14 4.5 } 
90-4 | 16 5.7 20 6.4 ‘ 
85-9 17 6.0 ll 3. 
80-4 14 5.0 9 2.9 
75-9 4 1.4 6 1.9 
70-4 | 5 1.8 3 9 
65-9 3 kek 1 -} 
60-4 1 -4 
2 315 
- eee ss oe ee sie 
Median 
IQ 107.9 110.3 














1. L. Me Terman, c of 


III (Stanford, California: Stanford University Press, 1980), p. 25. 


Genetic Studies of Genius, 
2, P. Cattell, "Constant Changes in the Stanford Binet IQ," Journal of Educational Psychology, XXII (October, 1951), 


p. 544. 


3. An example will make this clear. Suppose three children have initial IQ's of 122. 
The averages will be 152, 121, and 112, and the third case will 


one loses 2 points, and the third loses 20 points. 
be eliminated from the 120 group. 


4. E. Ae Lincoln, "The Later Performance of Under-aged Children Admitted to School on the Basis of Mental Age," Journal 


of Educational Researca, XIX (January, 1929), pp. 22-30. 





On later tests one gains 20points, 





oY itliliiana -- ee 


a 


wo ne 


“ SS ee 





288 JOURNAL OF EXPERIMENTAL EDUCATION 


IQ's show, and the girls were slightly Ssu- 
perior to the boys. There were 18.5 per 
cent of the boys and 23.1 per cent of the 
girls with initial IQ's of 120 or above. 
About one-third of these superior pupils had 
moved away from the towns in which they had 
been admitted to school, and so could not be 
re-examined. It does not appear probable 
that any special selective factor operated 
in determining the group which remained in 
the school systems. 

A summary of the IQ changes which were 
found appears in Table II. It is interesting 
to note that the range of changes in the 
group as a whole is very great, since it runs 
from a thirty-six point loss toa _ thirty- 
five point gain. Slightly more than one- 
third of the group changed less than five 
points, and in three cases there was no 
change at all. The losses outnumber the 
gains in the ratio of 3 to 2, Also, the 
losses are, on the average, nearly twice as 
great as the gains in magnitude. If the 
changes are distributed with regard to sign 
or direction, the median is -3,.67,. All these 
data indicate that the superior pupils lose 
rather than gain. 

The results for the two sexes shown in 
Table II indicate that the girls have about 
10 per cent fewer gains than the boys and 10 
per cent more losses, In size, the boys! 
median gain is nearly twice as large as the 
girls' median gain, while the girls' median 





Volume I, No. 3 


loss is almost twice as large as the boys’ 
median loss. These data taken together 
show what appears to be a real sex differ- 
ence, 

In order to get a selection of cases which 
more closely resembled Terman's very superi- 
or group, all the cases with initial IQ's 
of 130 or more were distributed separately, 
The results of this group are shown in Table 
III on the following page. They are not 
materially different from the results of the 
total group. The chief difference is that 
the ratio of losses to gains is slightly 
greater, and these ratios are more nearly 
alike for the sexes than in the total group, 
In the selected group, as in the total group, 
the average loss of the boys is substantial- 
ly smaller, and the average gain of the boys 
is substantially greater than the correspond- 
ing loss and gain for the girls, 

It thus appears that gifted children se- 
fected on the basis of a single initial Stan- 
ford-Binet IQ are likely to lose rather than 
gain, on the average, and that the girls are 
likely to lose more than the boys. There is 
no way of telling from the data at hand what 
would happen in a group selected on the basis 
of the average of two or more IQ's, since 
only one IQ was available at the beginning of 
the present study. 

An interesting analysis has been made of 
the failures on the separate tests in the 
age groups from ten to eighteen inclusive, 








TABLE II 
IQ CHANGES IN SUPERIOR CHILDREN 
CHANGE IN IQ 
+0 | 1-4 5-9 10-14 15-19 20-24 25-29 30-34 35-39 N % | Md. 
Boys' Gains mia 3 5 4 2 2 2 0 0 1 16 42.2 8.75 
Boys' Losses | a | 6 2 4 1 21| 55.2 | 7.08 
Boys' Total 1 | 13 10 + 6 3 0 0 1 38 | 100 7.50 
Girls' Gains 9 4 2 2 17; 31.5 4.78 
Girls’ Losses 7 4 13 5 1 2 2 1 35 64.8 12.50 
Girls' Total 2 16 8 15 7 1 2 2 1 54 100 8.12 
Total Gains 14 8 4 4 2 0 ie) 1 33 | 35.9 6.56 
Total Losses 15 10 15 9g 2 2 2 1 56 60.8 11.00 
Total Change 3 | 29 18 19 13 + 2 2 2 92 100 8.89 





























































varch, 1933 E. A. 


IQ CHANGES OF CHILDREN WITH 


TABLE III 


IQ CHANGE 





Lincoln 


INITIAL IQ'S OF 130 OR ABOVE 


























+0 | 1-4 | 5-9 | 10-14 | 15-19 | 20-24 | 25-29 | 30-34 | 35-391 N| $ | Md. 
Boys' Gains 2 | 2 | — inci Ses oe ; | 4] 88.8 | 7.50 ; 
Boys' Losses 3 4 | 1 | 8 | 66.7 6.25 
Boys' Total | 5 | 4 2 1 | | | ae | 6.25 
Girls' Gains | 3 tad 1 s| 36.5 | 4.33 
Girls' Losses | | 2 | 2 1 2 1 | 8| 61.5 | 10.00 
Girls' Total | }s | 3 | a l 2 1 | 138] | 7.50 
Total Gains | | s | a] e 1 9| 36.0 | 4.60 
Total Losses | =| 5 | 6 | 1 1 2 1 | 16] 64.0 | 7.50 
Total Change | | 10 | 7 | 3 | 2 | e | a |es| | 6.79 














on the Stanford scale. For this purpose 
each sex was divided into three groups: 

(1) Those who lost 5 or more points in 

IQ 
(2) Those who gained 5 or more points in 
IQ 

(3) Those whose IQ's changed less than 5 

points. 

The percentages of failure are shown in 
Table IV on the following page. While it is 
clear that the number of cases is as yet too 
small to permit of definite conclusions, it 
appears that there are some significant dif- 
ferences in the performances on the various 
items by those who gain, lose, or remain rel- 
atively constant. In Table V, page 291, are 
listed those tests in which the difference 
in the percentages of failure amounts to 19 
or more. These figures indicate several 
things. First, it is apparent that the 
Vocabulary and other linguistic tests, like 
Abstract Words and Mixed Sentences, are 
very important in differentiating among the 
various groups. The Digits-—Backwards Tests 
also stand out markedly in this respect, 
and the Induction and Fable Tests appear 
frequently in the table. Eleven of the 
tests show differences in the percentages of 
failure which are over thirty-five per 
cent. These differences are most certainly 
significant. 





From the percentages given in Table IV, 
it is possible to examine sex differences in 
the various tests. Those items in which the 
difference in failures was greater than 
nineteen per cent are listed in Table VI, 
page 292, together with an indication of the 
sex making the better record. An examina- 
tion of this table shows that thirteen of the 
comparisons are in favor of the girls, while 
nine are in favor of the boys. In the group 
of those pupils whose IQ's increased, all 
the comparisons are in favor of the girls. 
On the other hand, among those pupils whose 
IQ's dropped, five out of six of the com 
parisons are in favor of the boys. It is 
interesting to note that there is no consis- 
tent difference in linguistic abilities as 
evidenced by the Vocabulary and Abstract 
Words Tests, The boys seem to have reul 
superiority in the Induction Test, in which 
they excel in two of the groups, The girls 
are superior in two groups on the Clock prob- 
lems. 

This study is being continued with the 
yearly addition of new cases, and analyses 
are being made of the effects of the age at 
school entrance, the time interval between 
tests, the IQ level within the group, and 
other pertinent factors, Furthermore, 4 
group of similar children has been select- 
ed for repeated examinations at two year 
















290 JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No.3 


intervals, in order to discover, if possible,| interval, Later and more complete re- 
whether the change is gradual or abrupt,and| ports will be made as the study progress- 


whether it occurs during any particular es, 


TABLE IV 


. PERCENTAGES OF FAILURE OF VARIOUS GROUPS OF SUPERIOR CHILDREN ON THE SEPARATE 
TESTS OF THE STANFORD-BINET SCALE 





: 
| Boys losing | Boys gaining Boys Girls losing | Girls gaining 
Test | 5+ points 5+ points | Constant | 5+ points 5+ points 


25.0 





l 


Absurditie 0 


O-l Vocabulary | 7.7 
H Designs F 30.8 
: 

6 


e 

a] 

+ ae 
Now 


Reading 23.0 
Compre- 
hension 
60 Words 


B 


ew 
. 
an 


a oo OonCcCO 
a 

for) 

9 te 

fo) 


~ 


Vocabulary 
Abstract 
Words 
Ball & 
Field 
Mixed Sent. 
Fables 
5 Digits 
Backward 
Pictures 
Differences 


— 
baw] 
' 
~ 

ow Oo O8O 9C00°O 


© 
> 
wt 
. 
© 
. 
a 
~ 
. 
an 





ee 
ro 


wn ooo 


. ©¢ 
om 
ee 
~ 
. 
a 
a 


. 

on FO FO FRAUARNO 
od 
8S 


Vocabulary 
Induction 

Pres. -King 
Prob, Fact 
Arithmetic 
Clocks 


3 
4 
5 
6 
7 
8 
14-1 
2 
3 
“4 
5 
6 


. 
. 
Py 
. . 
goourn Ioo A 


SSRI8 ABS SEs Z 
QeAoOouU COO #O® 


a 
> 


8% ESSIEY oo 
88 S3gege ses & 


Go AUVTRAAO te 
NQ FANS OUH obS oun 


Vocabulary 

Fables 

Abstract 
Diff. 


lo-l 
2 
3 
4 Boxes 
5 
8 
-1 
2 


8 


a8 


6 Digits 
Backward 
Code 


so og aw woougu 


BB 88 


18-1 Vocabulary 
Paper Cut, 
3 8 Digits 
4 Memory 
5 7 Digits 
Backward 


6 Ingenuity 


e228 gz 
Noo 
S288 


~ 
&8 
o owe 





° 
5-3 
o| 8a 


~ 
fe 
































wvarch, 1932 E. A. Lincoln 291 


TABLE V 


TESTS IN WHICH THE DIFFERENCE IN FAILURES IS GREATER THAN 
NINETEEN PER CENT 





Pupils Who Lose Compared With Those Who Remain Constant 























Boys Girls 
10-3 Designs 23.1 10-1 Vocabulary 19.4 
10-4 Reading 23.0 10-3 Designs 19.4 
12-1 Vocabulary 47.2 12-4 Mixed Sentences 19.4 
12-2 Abstract 12-8 Similarities 26.3 
Words 34.1 14-2 Induction 19.5 
12-3 Ball and Field 24.1 14-4 Problems of Fact 35.6 
12-4 Mixed Sentences 54.4 
12-5 Fables 46.1 
12-6 5 Digits Back- 
ward 25.0 
12-8 Similarities 46.1 
14-5 Arithmetic 19.9 
16-2 Fables 27.5 
16-5 6 Digits Back- 
ward 21.3 
18-4 Memory 21.3 
Pupils Who Gain Compared With Those Who Remain Constant 
Boys Girls 
12-6 5 Digits Back- 12-1 Vocabulary 55.5 
ward 28.6 12-2 Abstract Words 82,5 
12-7 Pictures 24.7 12-3 Ball and Field 22.2 
14-1 Vocabulary 29.5 12-6 5 Digits Back- 
ward 27.4 
14-2 Induction 35.7 12-7 Pictures 38.9 
14-4 Problems of Fact 27.9 14-1 Vocabulary 24.7 
14-6 Clocks 26.0 14-2 Induction 43.0 
16-2 Fables 29.9 14-5 Arithmetic 28.0 
16-4 Boxes* 24.5 14-6 Clocks 32.0 
16-6 Code 22.5 16-2 Fables 24.7 
16-4 Memory* 21.3 16-5 6 Digits Back- 
ward 39.0 
16-6 Code 65.5 
18-2 Paper Cutting 58.5 
18-4 Memory 33.5 ; 
18-5 7 Digits Back- 
ward 19.5 








*In these tests the pupils who remained constant dia better than those who gained, 





JOURNAL OF EXPERIMENTAL EDUCATION Volume I, No, 3 


TABLE VI 


TESTS IN WHICH THE DIFFERENCE IN FAILURES BETWEEN BOYS AND GIRLS IS 
MORE THAN NINETEEN PER CENT 





More Than Nineteen Per Cent 





a. Those who lost five or more points: 


12-4 Mixed Sentences 
14-1 Vocabulary 

14-2 Induction 

14-3 President-King 
14-4 Problems of Fact 
16-4 Boxes 


who gained five or more points: 


12-1 Vocabulary 

12-2 Abstract Words 
14-1 Vocabulary 

14-5 Arithmetic 

14-6 Clocks 

16-4 Boxes 

16-5 Digits Backward 
16-6 Code 

18-2 Paper Cutting 
18-4 Memory 


LRAABAAV’ 


SSERPARRES 


who remained constant: 


12-1 Vocabulary 
12-2 Abstract Words 
14-1 Vocabulary 
14-2 Induction 

14-5 Arithmetic 
14-6 Clocks 


SBosee 
RERRAR 











